| Ahmed A. Eltawil, Michael Engel, Bibiche Geuskens, Amin Khajeh Djahromi, Fadi J Kurdahi, Peter Marwedel, Smail Niar and Mazen A. R. Saghir. A Survey of Cross-Layer Power-Performance-Reliability in Multi and Many Core Systems-on-Chip. Embedded Hardware Design: Microprocessors and Microsystems 2013, Invited paper [BibTeX][Link][Abstract]@article { kurdahi:2013:EHD,
author = {Eltawil, Ahmed A. and Engel, Michael and Geuskens, Bibiche and Djahromi, Amin Khajeh and Kurdahi, Fadi J and Marwedel, Peter and Niar, Smail and Saghir, Mazen A. R.},
title = {A Survey of Cross-Layer Power-Performance-Reliability in Multi and Many Core Systems-on-Chip},
journal = {Embedded Hardware Design: Microprocessors and Microsystems},
year = {2013},
note = {Invited paper},
url = {http://www.sciencedirect.com/science/article/pii/S0141933113000987},
keywords = {ders},
confidential = {n},
abstract = {As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability. Today, designers must consider building systems that achieve the requisite functionality and performance using components that may be unreliable. In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application. This will enable the most general tradeoff exploration, reaping the most benefits in power, performance and reliability. This paper surveys various cross layer techniques and approaches for power, performance, and reliability tradeoffs are technology, circuit, architecture and application layers.
},
} As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability. Today, designers must consider building systems that achieve the requisite functionality and performance using components that may be unreliable. In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application. This will enable the most general tradeoff exploration, reaping the most benefits in power, performance and reliability. This paper surveys various cross layer techniques and approaches for power, performance, and reliability tradeoffs are technology, circuit, architecture and application layers.
|
| Andreas Heinig, Ingo Korb, Florian Schmoll, Peter Marwedel and Michael Engel. Fast and Low-Cost Instruction-Aware Fault Injection. In Proc. of SOBRES 2013 2013 [BibTeX][Link][Abstract]@inproceedings { heinig:2013:sobres,
author = {Heinig, Andreas and Korb, Ingo and Schmoll, Florian and Marwedel, Peter and Engel, Michael},
title = {Fast and Low-Cost Instruction-Aware Fault Injection},
booktitle = {Proc. of SOBRES 2013},
year = {2013},
url = {http://danceos.org/sobres/2013/papers/SOBRES-640-Heinig.pdf},
keywords = {ders},
confidential = {n},
abstract = {In order to assess the robustness of software-based fault-tolerance methods, extensive tests have to be performed that inject faults, such as bit flips, into hardware components of a running system. Fault injection commonly uses either system simulations, resulting in execution times orders of magnitude longer than on real systems, or exposes a real system to error sources like radiation. This can take place in real time, but it enables only a very coarse-grained control over the affected system component.
A solution combining the best characteristics from both approaches should achieve precise fault injection in real hardware systems. The approach presented in this paper uses the JTAG background debug facility of a CPU to inject faults into main memory and registers of a running system. Compared to similar earlier approaches, our solution is able to achieve rapid fault injection using a low-cost microcontroller instead of a complex FPGA. Consequently, our injection software is much more flexible. It allows to restrict error injection to the execution of a set of predefined components, resulting in a more precise control of the injection, and also emulates error reporting, which enables the evaluation of different error detection approaches in addition to robustness evaluation.},
} In order to assess the robustness of software-based fault-tolerance methods, extensive tests have to be performed that inject faults, such as bit flips, into hardware components of a running system. Fault injection commonly uses either system simulations, resulting in execution times orders of magnitude longer than on real systems, or exposes a real system to error sources like radiation. This can take place in real time, but it enables only a very coarse-grained control over the affected system component.
A solution combining the best characteristics from both approaches should achieve precise fault injection in real hardware systems. The approach presented in this paper uses the JTAG background debug facility of a CPU to inject faults into main memory and registers of a running system. Compared to similar earlier approaches, our solution is able to achieve rapid fault injection using a low-cost microcontroller instead of a complex FPGA. Consequently, our injection software is much more flexible. It allows to restrict error injection to the execution of a set of predefined components, resulting in a more precise control of the injection, and also emulates error reporting, which enables the evaluation of different error detection approaches in addition to robustness evaluation.
|
| Horst Schirmeier, Ingo Korb, Olaf Spinczyk and Michael Engel. Efficient Online Memory Error Assessment and Circumvention for Linux with RAMpage. International Journal of Critical Computer-Based Systems Special Issue on PRDC 2011 Dependable Architecture and Analysis 2013 [BibTeX][Link][Abstract]@article { schirmeier:2013:ijccbs,
author = {Schirmeier, Horst and Korb, Ingo and Spinczyk, Olaf and Engel, Michael},
title = {Efficient Online Memory Error Assessment and Circumvention for Linux with RAMpage},
journal = {International Journal of Critical Computer-Based Systems},
year = {2013},
volume = {Special Issue on PRDC 2011 Dependable Architecture and Analysis},
url = {http://www.inderscience.com/info/ingeneral/forthcoming.php?jcode=ijccbs},
keywords = {ders},
confidential = {n},
abstract = {Memory errors are a major source of reliability problems in computer systems. Undetected errors may result in program termination or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. To reduce the impact of memory errors, we designed RAMpage, a purely software-based infrastructure to assess and circumvent permanent memory errors in a running commodity x86-64 Linux-based system. We briefly describe the design and implementation of RAMpage and present new results from an extensive qualitative and quantitative evaluation. These results show the efficiency of our approach -- RAMpage is able to provide a smooth graceful degradation in case of permanent memory errors while requiring only a small overhead in terms of CPU time, energy, and memory space.
},
} Memory errors are a major source of reliability problems in computer systems. Undetected errors may result in program termination or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. To reduce the impact of memory errors, we designed RAMpage, a purely software-based infrastructure to assess and circumvent permanent memory errors in a running commodity x86-64 Linux-based system. We briefly describe the design and implementation of RAMpage and present new results from an extensive qualitative and quantitative evaluation. These results show the efficiency of our approach -- RAMpage is able to provide a smooth graceful degradation in case of permanent memory errors while requiring only a small overhead in terms of CPU time, energy, and memory space.
|
| Florian Schmoll, Andreas Heinig, Peter Marwedel and Michael Engel. Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods. ACM Transactions on Embedded Computing Systems (TECS) 13 1s, pages 31:1--31:27 December 2013 [BibTeX][Link][Abstract]@article { schmoll:2013:tecs,
author = {Schmoll, Florian and Heinig, Andreas and Marwedel, Peter and Engel, Michael},
title = {Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods},
journal = {ACM Transactions on Embedded Computing Systems (TECS)},
year = {2013},
volume = {13},
number = {1s},
pages = {31:1--31:27},
month = {dec},
url = {http://doi.acm.org/10.1145/2536747.2536753},
keywords = {ders},
confidential = {n},
abstract = {Fault tolerance rapidly evolves into one of the most significant design objectives for embedded systems due to reduced semiconductor structures and supply voltages. However, resource-constrained systems cannot afford traditional error correction for overhead and cost reasons. New methods are required to sustain acceptable service quality in case of errors while avoiding crashes.
We present a flexible fault-tolerance approach that is able to select correction actions depending on error semantics using application annotations and static analysis approaches. We verify the validity of our approach by analyzing the vulnerability and improving the reliability of an H.264 decoder using flexible error handling.},
} Fault tolerance rapidly evolves into one of the most significant design objectives for embedded systems due to reduced semiconductor structures and supply voltages. However, resource-constrained systems cannot afford traditional error correction for overhead and cost reasons. New methods are required to sustain acceptable service quality in case of errors while avoiding crashes.
We present a flexible fault-tolerance approach that is able to select correction actions depending on error semantics using application annotations and static analysis approaches. We verify the validity of our approach by analyzing the vulnerability and improving the reliability of an H.264 decoder using flexible error handling.
|
| A. Herkersdorf, M. Engel, M. Glaß, J. Henkel, V.B. Kleeberger, M.A. Kochte, J.M. Kühn, S.R. Nassif, H. Rauchfuss, W. Rosenstiel, U. Schlichtmann, M. Shafique, M.B. Tahoori, J. Teich, N. Wehn, C. Weis and H.-J. Wunderlich. Cross-Layer Dependability Modeling and Abstraction in Systems on Chip. In Proceedings of the Workshop on Silicon Errors in Logic System Effects (SELSE) March 2013 [BibTeX][PDF][Abstract]@inproceedings { herkersdorf:2013:selse,
author = {Herkersdorf, A. and Engel, M. and Gla\"s, M. and Henkel, J. and Kleeberger, V.B. and Kochte, M.A. and K\"uhn, J.M. and Nassif, S.R. and Rauchfuss, H. and Rosenstiel, W. and Schlichtmann, U. and Shafique, M. and Tahoori, M.B. and Teich, J. and Wehn, N. and Weis, C. and Wunderlich, H.-J.},
title = {Cross-Layer Dependability Modeling and Abstraction in Systems on Chip},
booktitle = {Proceedings of the Workshop on Silicon Errors in Logic System Effects (SELSE)},
year = {2013},
month = {March},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-selse-herkersdorf.pdf},
confidential = {n},
abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple
bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about
the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher
layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software
variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.},
} The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple
bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about
the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher
layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software
variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.
|
| Björn Döbel, Horst Schirmeier and Michael Engel. Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment . In Proceedings of the 5th Workshop on Design for Reliability (DFR) January 2013, - Best Poster Award - [BibTeX][PDF][Abstract]@inproceedings { doebel:2013:dfr,
author = {D\"obel, Bj\"orn and Schirmeier, Horst and Engel, Michael},
title = {Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment },
booktitle = {Proceedings of the 5th Workshop on Design for Reliability (DFR)},
year = {2013},
month = {January},
note = {- Best Poster Award -},
keywords = {ders},
file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-dfr-doebel.pdf},
confidential = {n},
abstract = {From a software developer's perspective, fault injection (FI) is the most complete way of evaluating the sensitivity of a program against hardware errors. Unfortunately, FI campaigns require a substantial investment of both, time and computing resources, making their application infeasible in many cases.
Program Vulnerability Factor (PVF) analysis has been proposed as an alternative for estimating software vulnerability. In this paper we present PVF/x86, a tool for computing the PVF for x86 programs. We validate the use of PVF analysis by running PVF/x86 on an image decoder application and compare the results to those obtained with a state-of-the-art FI framework. We identify weak spots of PVF analysis and outline ideas for addressing those points.},
} From a software developer's perspective, fault injection (FI) is the most complete way of evaluating the sensitivity of a program against hardware errors. Unfortunately, FI campaigns require a substantial investment of both, time and computing resources, making their application infeasible in many cases.
Program Vulnerability Factor (PVF) analysis has been proposed as an alternative for estimating software vulnerability. In this paper we present PVF/x86, a tool for computing the PVF for x86 programs. We validate the use of PVF analysis by running PVF/x86 on an image decoder application and compare the results to those obtained with a state-of-the-art FI framework. We identify weak spots of PVF analysis and outline ideas for addressing those points.
|