Publication on Machine Learning to Predict Hardware Failure Recognized as Best of SELSE
A paper presented at a Silicon Errors in Logic – System Effects (SELSE 2020) workshop in February has earned best paper recognition for Assistant Professor of Electrical and Computer Engineering Dr. Xun Jiao and his PhD student Dongning Ma. As one of three “Best of SELSE” papers, “An Input-aware Learning-based Error Model of Voltage-Scaled Functional Units” will be presented virtually during a special session of the Dependable Systems and Networks 2020 conference in late June. The paper has also been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
Dr. Jiao explains in simple terms that this paper “uses machine learning, one form of artificial intelligence, to predict hardware failure under extreme operating conditions.”
From the abstract:
As Moore’s Law comes to an end and transistor scaling increasingly falls short in improving energy efficiency, alternative computing paradigms are direly needed. This need is further highlighted by the overwhelming increase in computing demand posed by emerging applications such as multimedia and data analysis. Fortunately, such driving workloads also present new opportunities since, thanks to their inherent error tolerance, they do not require completely accurate computations. Thus, by trading off accuracy for better performance or improved efficiency, approximate computing promises tremendous growth for future computing. Various approximation methods demonstrate the effectiveness of voltage scaling in functional units (FUs) for exploring this energy-error trade-off. Yet, while an accurate error model is critical for assessing the error behavior of voltage-scaled FUs and its effects on application quality, existing error models of voltage-scaled FUs overlook the effects of input data and error rate disparity among different bits. To tackle this challenge, we propose LEVAX, an input-aware learning-based error model of voltage-scaled FUs that can predict the timing error rate (TER) for each output bit. This model is trained using random forest methods, with input features and output labels extracted from gate-level simulations. To validate its effectiveness and demonstrate its prediction accuracy, we use LEVAX on various FUs. Across all bit positions, voltage levels, and FUs, LEVAX achieves, on average, a relative error of 1.20%. LEVAX also achieves an average per-voltage Root Mean Square Error (RMSE) of 1.03% and per-bit RMSE of 1.17%. Exposing this error rate even up to the application level, LEVAX can estimate the quality of four image processing applications under voltage scaling with an average accuracy of 97.9%. To the best of our knowledge, LEVAX is the first voltage scaling error model of FUs that can incorporate the effects of input data.
This is Dr. Jiao’s third best paper award in the past 12 months. “Uncertainty Theory Based Reliability-Centric Cyber-Physical System Design” won the Best Paper award at the 2019 IEEE International Conference on Cyber Physical and Social Computing, and the Association for Computing Machinery’s International Conference on Embedded Software selected his paper “Polar: Function Code Aware Fuzz Testing of ICS Protocol” as a candidate for best paper—against one from Duke University and another from Cambridge University.