Hauptinhalt

LS12 Publications on Dependable Embedded Systems

2014

Andreas Herkersdorf, Hananeh Aliee, Michael Engel, Michael Glaß, Christina Gimmler-Dumont, Jörg Henkel, Veit B Kleeberger, Michael A Kochte, Johannes M Kühn, Daniel Mueller-Gritschneder, Sani R Nassif, Holm Rauchfuss, Wolfgang Rosenstiel, Ulf Schlichtmann, Muhammad Shafique, Mehdi B Tahoori, Jürgen Teich, Norbert Wehn, Christian Weis and Hans-Joachim Wunderlich.
Resilience Articulation Point (RAP): Cross-layer dependability modeling for nanometer system-on-chip resilience.
Microelectronics Reliability
2014
[BibTeX][Link][Abstract]

@article { Herkersdorf:2014,
  author = {Herkersdorf, Andreas and Aliee, Hananeh and Engel, Michael and Gla\"s, Michael and Gimmler-Dumont, Christina and Henkel, J\"org and Kleeberger, Veit B and Kochte, Michael A and K\"uhn, Johannes M and Mueller-Gritschneder, Daniel and Nassif, Sani R and Rauchfuss, Holm and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Shafique, Muhammad and Tahoori, Mehdi B and Teich, J\"urgen and Wehn, Norbert and Weis, Christian and Wunderlich, Hans-Joachim},
  title = {Resilience Articulation Point (RAP): Cross-layer dependability modeling for nanometer system-on-chip resilience},
  journal = {Microelectronics Reliability},
  year = {2014},
  url = {http://dx.doi.org/10.1016/j.microrel.2013.12.012},
  keywords = {daes, ders},
  confidential = {n},
  abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit flip(s). When probabilistic error functions for specific fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro-interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells, voltage variations and sequential CMOS logic. It shows by example how probabilistic bit flips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture-level resilience methods.},
}

Andreas Heinig, Florian Schmoll, Peter Marwedel and Michael Engel.
Who's Using that Memory? A Subscriber Model for Mapping Errors to Tasks.
In Proceedings of the 10th Workshop on Silicon Errors in Logic - System Effects (SELSE)
Stanford, CA, USA, April 2014
[BibTeX][PDF][Abstract]

@inproceedings { heinig:2014:SELSE,
  author = {Heinig, Andreas and Schmoll, Florian and Marwedel, Peter and Engel, Michael},
  title = {Who's Using that Memory? A Subscriber Model for Mapping Errors to Tasks},
  booktitle = {Proceedings of the 10th Workshop on Silicon Errors in Logic - System Effects (SELSE)},
  year = {2014},
  address = {Stanford, CA, USA},
  month = {April},
  keywords = {ders},
  file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2014-heinig-selse2014.pdf},
  confidential = {n},
  abstract = {In order to assess the robustness of software-based fault-tolerance
methods, extensive tests have to be performed that
inject faults, such as bit flips, into hardware components of a running
system. Fault injection commonly uses either system simulations, resulting
in execution times orders of magnitude longer than on real systems, or
exposes a real system to error sources like radiation. This can take place
in real time, but it enables only a very coarse-grained control over the
affected system component.

A solution combining the best characteristics from both approaches should
achieve precise fault injection in real hardware systems. The approach
presented in this paper uses the JTAG background debug facility of a CPU
to inject faults into main memory and registers of a running system. Compared
to similar earlier approaches, our solution is able to achieve rapid
fault injection using a low-cost microcontroller instead of a complex
FPGA. Consequently, our injection software is much more flexible. It
allows to restrict error injection to the execution of a set of predefined
components, resulting in a more precise control of the injection, and
also emulates error reporting, which enables the evaluation
of different error detection approaches in addition to robustness
evaluation.
},
}

2013

	Ahmed A. Eltawil, Michael Engel, Bibiche Geuskens, Amin Khajeh Djahromi, Fadi J Kurdahi, Peter Marwedel, Smail Niar and Mazen A. R. Saghir. A Survey of Cross-Layer Power-Performance-Reliability in Multi and Many Core Systems-on-Chip. Embedded Hardware Design: Microprocessors and Microsystems 2013, Invited paper [BibTeX][Link][Abstract] @article { kurdahi:2013:EHD, author = {Eltawil, Ahmed A. and Engel, Michael and Geuskens, Bibiche and Djahromi, Amin Khajeh and Kurdahi, Fadi J and Marwedel, Peter and Niar, Smail and Saghir, Mazen A. R.}, title = {A Survey of Cross-Layer Power-Performance-Reliability in Multi and Many Core Systems-on-Chip}, journal = {Embedded Hardware Design: Microprocessors and Microsystems}, year = {2013}, note = {Invited paper}, url = {http://www.sciencedirect.com/science/article/pii/S0141933113000987}, keywords = {ders}, confidential = {n}, abstract = {As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability. Today, designers must consider building systems that achieve the requisite functionality and performance using components that may be unreliable. In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application. This will enable the most general tradeoff exploration, reaping the most benefits in power, performance and reliability. This paper surveys various cross layer techniques and approaches for power, performance, and reliability tradeoffs are technology, circuit, architecture and application layers. }, } As systems-on-chip increase in complexity, the underlying technology presents us with significant challenges due to increased power consumption as well as decreased reliability. Today, designers must consider building systems that achieve the requisite functionality and performance using components that may be unreliable. In order to do so, it is crucial to understand the close interplay between the different layers of a system: technology, platform, and application. This will enable the most general tradeoff exploration, reaping the most benefits in power, performance and reliability. This paper surveys various cross layer techniques and approaches for power, performance, and reliability tradeoffs are technology, circuit, architecture and application layers.
	Andreas Heinig, Ingo Korb, Florian Schmoll, Peter Marwedel and Michael Engel. Fast and Low-Cost Instruction-Aware Fault Injection. In Proc. of SOBRES 2013 2013 [BibTeX][Link][Abstract] @inproceedings { heinig:2013:sobres, author = {Heinig, Andreas and Korb, Ingo and Schmoll, Florian and Marwedel, Peter and Engel, Michael}, title = {Fast and Low-Cost Instruction-Aware Fault Injection}, booktitle = {Proc. of SOBRES 2013}, year = {2013}, url = {http://danceos.org/sobres/2013/papers/SOBRES-640-Heinig.pdf}, keywords = {ders}, confidential = {n}, abstract = {In order to assess the robustness of software-based fault-tolerance methods, extensive tests have to be performed that inject faults, such as bit flips, into hardware components of a running system. Fault injection commonly uses either system simulations, resulting in execution times orders of magnitude longer than on real systems, or exposes a real system to error sources like radiation. This can take place in real time, but it enables only a very coarse-grained control over the affected system component. A solution combining the best characteristics from both approaches should achieve precise fault injection in real hardware systems. The approach presented in this paper uses the JTAG background debug facility of a CPU to inject faults into main memory and registers of a running system. Compared to similar earlier approaches, our solution is able to achieve rapid fault injection using a low-cost microcontroller instead of a complex FPGA. Consequently, our injection software is much more flexible. It allows to restrict error injection to the execution of a set of predefined components, resulting in a more precise control of the injection, and also emulates error reporting, which enables the evaluation of different error detection approaches in addition to robustness evaluation.}, } In order to assess the robustness of software-based fault-tolerance methods, extensive tests have to be performed that inject faults, such as bit flips, into hardware components of a running system. Fault injection commonly uses either system simulations, resulting in execution times orders of magnitude longer than on real systems, or exposes a real system to error sources like radiation. This can take place in real time, but it enables only a very coarse-grained control over the affected system component. A solution combining the best characteristics from both approaches should achieve precise fault injection in real hardware systems. The approach presented in this paper uses the JTAG background debug facility of a CPU to inject faults into main memory and registers of a running system. Compared to similar earlier approaches, our solution is able to achieve rapid fault injection using a low-cost microcontroller instead of a complex FPGA. Consequently, our injection software is much more flexible. It allows to restrict error injection to the execution of a set of predefined components, resulting in a more precise control of the injection, and also emulates error reporting, which enables the evaluation of different error detection approaches in addition to robustness evaluation.
	Horst Schirmeier, Ingo Korb, Olaf Spinczyk and Michael Engel. Efficient Online Memory Error Assessment and Circumvention for Linux with RAMpage. International Journal of Critical Computer-Based Systems Special Issue on PRDC 2011 Dependable Architecture and Analysis 2013 [BibTeX][Link][Abstract] @article { schirmeier:2013:ijccbs, author = {Schirmeier, Horst and Korb, Ingo and Spinczyk, Olaf and Engel, Michael}, title = {Efficient Online Memory Error Assessment and Circumvention for Linux with RAMpage}, journal = {International Journal of Critical Computer-Based Systems}, year = {2013}, volume = {Special Issue on PRDC 2011 Dependable Architecture and Analysis}, url = {http://www.inderscience.com/info/ingeneral/forthcoming.php?jcode=ijccbs}, keywords = {ders}, confidential = {n}, abstract = {Memory errors are a major source of reliability problems in computer systems. Undetected errors may result in program termination or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. To reduce the impact of memory errors, we designed RAMpage, a purely software-based infrastructure to assess and circumvent permanent memory errors in a running commodity x86-64 Linux-based system. We briefly describe the design and implementation of RAMpage and present new results from an extensive qualitative and quantitative evaluation. These results show the efficiency of our approach -- RAMpage is able to provide a smooth graceful degradation in case of permanent memory errors while requiring only a small overhead in terms of CPU time, energy, and memory space. }, } Memory errors are a major source of reliability problems in computer systems. Undetected errors may result in program termination or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. To reduce the impact of memory errors, we designed RAMpage, a purely software-based infrastructure to assess and circumvent permanent memory errors in a running commodity x86-64 Linux-based system. We briefly describe the design and implementation of RAMpage and present new results from an extensive qualitative and quantitative evaluation. These results show the efficiency of our approach -- RAMpage is able to provide a smooth graceful degradation in case of permanent memory errors while requiring only a small overhead in terms of CPU time, energy, and memory space.
	Florian Schmoll, Andreas Heinig, Peter Marwedel and Michael Engel. Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods. ACM Transactions on Embedded Computing Systems (TECS) 13 1s, pages 31:1--31:27 December 2013 [BibTeX][Link][Abstract] @article { schmoll:2013:tecs, author = {Schmoll, Florian and Heinig, Andreas and Marwedel, Peter and Engel, Michael}, title = {Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods}, journal = {ACM Transactions on Embedded Computing Systems (TECS)}, year = {2013}, volume = {13}, number = {1s}, pages = {31:1--31:27}, month = {dec}, url = {http://doi.acm.org/10.1145/2536747.2536753}, keywords = {ders}, confidential = {n}, abstract = {Fault tolerance rapidly evolves into one of the most significant design objectives for embedded systems due to reduced semiconductor structures and supply voltages. However, resource-constrained systems cannot afford traditional error correction for overhead and cost reasons. New methods are required to sustain acceptable service quality in case of errors while avoiding crashes. We present a flexible fault-tolerance approach that is able to select correction actions depending on error semantics using application annotations and static analysis approaches. We verify the validity of our approach by analyzing the vulnerability and improving the reliability of an H.264 decoder using flexible error handling.}, } Fault tolerance rapidly evolves into one of the most significant design objectives for embedded systems due to reduced semiconductor structures and supply voltages. However, resource-constrained systems cannot afford traditional error correction for overhead and cost reasons. New methods are required to sustain acceptable service quality in case of errors while avoiding crashes. We present a flexible fault-tolerance approach that is able to select correction actions depending on error semantics using application annotations and static analysis approaches. We verify the validity of our approach by analyzing the vulnerability and improving the reliability of an H.264 decoder using flexible error handling.
	A. Herkersdorf, M. Engel, M. Glaß, J. Henkel, V.B. Kleeberger, M.A. Kochte, J.M. Kühn, S.R. Nassif, H. Rauchfuss, W. Rosenstiel, U. Schlichtmann, M. Shafique, M.B. Tahoori, J. Teich, N. Wehn, C. Weis and H.-J. Wunderlich. Cross-Layer Dependability Modeling and Abstraction in Systems on Chip. In Proceedings of the Workshop on Silicon Errors in Logic System Effects (SELSE) March 2013 [BibTeX][PDF][Abstract] @inproceedings { herkersdorf:2013:selse, author = {Herkersdorf, A. and Engel, M. and Gla\"s, M. and Henkel, J. and Kleeberger, V.B. and Kochte, M.A. and K\"uhn, J.M. and Nassif, S.R. and Rauchfuss, H. and Rosenstiel, W. and Schlichtmann, U. and Shafique, M. and Tahoori, M.B. and Teich, J. and Wehn, N. and Weis, C. and Wunderlich, H.-J.}, title = {Cross-Layer Dependability Modeling and Abstraction in Systems on Chip}, booktitle = {Proceedings of the Workshop on Silicon Errors in Logic System Effects (SELSE)}, year = {2013}, month = {March}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-selse-herkersdorf.pdf}, confidential = {n}, abstract = {The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit ﬂip(s). When probabilistic error functions for speciﬁc fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit ﬂips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.}, } The Resilience Articulation Point (RAP) model aims at provisioning researchers and developers with a probabilistic fault abstraction and error propagation framework covering all hardware/software layers of a System on Chip. RAP assumes that physically induced faults at the technology or CMOS device layer will eventually manifest themselves as a single or multiple bit ﬂip(s). When probabilistic error functions for speciﬁc fault origins are known at the bit or signal level, knowledge about the unit of design and its environment allow the transformation of the bit-related error functions into characteristic higher layer representations, such as error functions for data words, Finite State Machine (FSM) state, macro interfaces or software variables. Thus, design concerns at higher abstraction layers can be investigated without the necessity to further consider the full details of lower levels of design. This paper introduces the ideas of RAP based on examples of radiation induced soft errors in SRAM cells and sequential CMOS logic. It shows by example how probabilistic bit ﬂips are systematically abstracted and propagated towards higher abstraction levels up to the application software layer, and how RAP can be used to parameterize architecture level resilience methods.
	Björn Döbel, Horst Schirmeier and Michael Engel. Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment . In Proceedings of the 5th Workshop on Design for Reliability (DFR) January 2013, - Best Poster Award - [BibTeX][PDF][Abstract] @inproceedings { doebel:2013:dfr, author = {D\"obel, Bj\"orn and Schirmeier, Horst and Engel, Michael}, title = {Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment }, booktitle = {Proceedings of the 5th Workshop on Design for Reliability (DFR)}, year = {2013}, month = {January}, note = {- Best Poster Award -}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2013-dfr-doebel.pdf}, confidential = {n}, abstract = {From a software developer's perspective, fault injection (FI) is the most complete way of evaluating the sensitivity of a program against hardware errors. Unfortunately, FI campaigns require a substantial investment of both, time and computing resources, making their application infeasible in many cases. Program Vulnerability Factor (PVF) analysis has been proposed as an alternative for estimating software vulnerability. In this paper we present PVF/x86, a tool for computing the PVF for x86 programs. We validate the use of PVF analysis by running PVF/x86 on an image decoder application and compare the results to those obtained with a state-of-the-art FI framework. We identify weak spots of PVF analysis and outline ideas for addressing those points.}, } From a software developer's perspective, fault injection (FI) is the most complete way of evaluating the sensitivity of a program against hardware errors. Unfortunately, FI campaigns require a substantial investment of both, time and computing resources, making their application infeasible in many cases. Program Vulnerability Factor (PVF) analysis has been proposed as an alternative for estimating software vulnerability. In this paper we present PVF/x86, a tool for computing the PVF for x86 programs. We validate the use of PVF analysis by running PVF/x86 on an image decoder application and compare the results to those obtained with a state-of-the-art FI framework. We identify weak spots of PVF analysis and outline ideas for addressing those points.

2012

	Björn Döbel, Hermann Härtig and Michael Engel. Operating System Support for Redundant Multithreading. In Proceedings of EMSOFT October 2012 [BibTeX][PDF][Abstract] @inproceedings { doebel:2012:EMSOFT, author = {D\"obel, Bj\"orn and H\"artig, Hermann and Engel, Michael}, title = {Operating System Support for Redundant Multithreading}, booktitle = {Proceedings of EMSOFT}, year = {2012}, month = {oct}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-emsoft-doebel.pdf}, confidential = {n}, abstract = {In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to cope with these issues either use hardware functionality that is not available in commercial-oﬀ-the-shelf (COTS) systems or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible. In this paper we present Romain, a framework that provides transparent redundant multithreading1 as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30 % for triple modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication. }, } In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to cope with these issues either use hardware functionality that is not available in commercial-oﬀ-the-shelf (COTS) systems or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible. In this paper we present Romain, a framework that provides transparent redundant multithreading1 as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30 % for triple modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication.
	Michael Engel and Björn Döbel. The Reliable Computing Base – A Paradigm for Software-based Reliability. In Proceedings of SOBRES September 2012 [BibTeX][PDF][Abstract] @inproceedings { engel:2012:sobres, author = {Engel, Michael and D\"obel, Bj\"orn}, title = {The Reliable Computing Base – A Paradigm for Software-based Reliability}, booktitle = {Proceedings of SOBRES}, year = {2012}, month = {sep}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-sobres-engel.pdf}, confidential = {n}, abstract = {For embedded systems, the use of software-based error detection and correction approaches is an attractive means in order to reduce often inconvenient overheads in hardware. To ensure that such a software-based fault-tolerance approach is effective, it must be guaranteed that a certain amount of hardware and software components in a system can be trusted to provide correct service in the presence of errors. In analogy with the Trusted Computing Base (TCB) in security research, we call these components the Reliable Computing Base (RCB). Similar to the TCB, it is also desirable to reduce the size of the RCB, so the overhead in redundant hardware resources can be reduced. In this position paper, we describe approaches for informal as well as formal definitions of the RCB, the related metrics and approaches for RCB minimization. }, } For embedded systems, the use of software-based error detection and correction approaches is an attractive means in order to reduce often inconvenient overheads in hardware. To ensure that such a software-based fault-tolerance approach is effective, it must be guaranteed that a certain amount of hardware and software components in a system can be trusted to provide correct service in the presence of errors. In analogy with the Trusted Computing Base (TCB) in security research, we call these components the Reliable Computing Base (RCB). Similar to the TCB, it is also desirable to reduce the size of the RCB, so the overhead in redundant hardware resources can be reduced. In this position paper, we describe approaches for informal as well as formal definitions of the RCB, the related metrics and approaches for RCB minimization.
	Andreas Heinig, Vincent J. Mooney, Florian Schmoll, Peter Marwedel, Krishna Palem and Michael Engel. Classification-based Improvement of Application Robustness and Quality of Service in Probabilistic Computer Systems. In Proceedings of ARCS 2012 - International Conference on Architecture of Computing Systems Munich, Germany, March 2012, -- ARCS 2012 Best Paper Award Winner -- [BibTeX][PDF][Abstract] @inproceedings { heinig:2012:arcs, author = {Heinig, Andreas and Mooney, Vincent J. and Schmoll, Florian and Marwedel, Peter and Palem, Krishna and Engel, Michael}, title = {Classification-based Improvement of Application Robustness and Quality of Service in Probabilistic Computer Systems}, booktitle = {Proceedings of ARCS 2012 - International Conference on Architecture of Computing Systems}, year = {2012}, address = {Munich, Germany}, month = {mar}, note = {-- ARCS 2012 Best Paper Award Winner --}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2012-arcs-heinig.pdf}, confidential = {n}, abstract = {Future semiconductors no longer guarantee permanent de- terministic operation. They are expected to show probabilistic behavior due to lowered voltages and shrinking structures. Compared to radiation-induced errors, probabilistic systems face increa- sed error frequencies leading to unexpected bit-flips. Approaches like probabilistic CMOS provide methods to control error distributions which reduce the error probability in more significant bits. However, instruc- tions handling control flow or pointers still require determinism, requiring a classification to identify these instructions. We apply our transient error classification to probabilistic circuits using differing voltage distributions. Static analysis ensures that probabilistic effects only affect unreliable operations which accept a certain level of impreciseness, and that errors in probabilistic components will never propagate to critical operations. To evaluate, we analyze robustness and quality-of-service of an H.264 video decoder. Using classification results, we map unreliable arithmetic operations onto probabilistic components of an MPARM model, while remaining operations use deterministic components.}, } Future semiconductors no longer guarantee permanent de- terministic operation. They are expected to show probabilistic behavior due to lowered voltages and shrinking structures. Compared to radiation-induced errors, probabilistic systems face increa- sed error frequencies leading to unexpected bit-flips. Approaches like probabilistic CMOS provide methods to control error distributions which reduce the error probability in more significant bits. However, instruc- tions handling control flow or pointers still require determinism, requiring a classification to identify these instructions. We apply our transient error classification to probabilistic circuits using differing voltage distributions. Static analysis ensures that probabilistic effects only affect unreliable operations which accept a certain level of impreciseness, and that errors in probabilistic components will never propagate to critical operations. To evaluate, we analyze robustness and quality-of-service of an H.264 video decoder. Using classification results, we map unreliable arithmetic operations onto probabilistic components of an MPARM model, while remaining operations use deterministic components.
	Michael Engel and Peter Marwedel. Semantic Gaps in Software-Based Reliability. In Proceedings of the 4th Workshop on Design for Reliability (DFR'12) Paris, France, January 2012 [BibTeX][Abstract] @inproceedings { engel:dfr:2012, author = {Engel, Michael and Marwedel, Peter}, title = {Semantic Gaps in Software-Based Reliability}, booktitle = {Proceedings of the 4th Workshop on Design for Reliability (DFR'12)}, year = {2012}, address = {Paris, France}, month = {jan}, organization = {HiPEAC}, keywords = {ders}, confidential = {n}, abstract = {Future semiconductors will show a heterogeneous distribution of permanent faults as a result of fabrication variations and aging. To increase yields and lifetimes of these chips, a fault tolerance approach is required that handles resources on a small-scale basis with low overhead. In embedded systems, this overhead can be reduced by classifying data and instructions to determine the varying impact of errors on different instructions and data. Using this classification, only errors with significant impact on system behavior have to be corrected. In this position paper, we describe one problem with this analysis, the semantic gap between high-level language source code and the low-level data flow through architecture components. In addition, we discuss possible approaches to handle this gap. Of special interest are the implications on achieving reliable execution of dependability-critical code. }, } Future semiconductors will show a heterogeneous distribution of permanent faults as a result of fabrication variations and aging. To increase yields and lifetimes of these chips, a fault tolerance approach is required that handles resources on a small-scale basis with low overhead. In embedded systems, this overhead can be reduced by classifying data and instructions to determine the varying impact of errors on different instructions and data. Using this classification, only errors with significant impact on system behavior have to be corrected. In this position paper, we describe one problem with this analysis, the semantic gap between high-level language source code and the low-level data flow through architecture components. In addition, we discuss possible approaches to handle this gap. Of special interest are the implications on achieving reliable execution of dependability-critical code.

2011

	Horst Schirmeier, Jens Neuhalfen, Ingo Korb, Olaf Spinczyk and Michael Engel. RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers. In Proceedings of the 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2011) Pasadena, USA, 2011 [BibTeX][PDF][Abstract] @inproceedings { schirmeier:11:prdc, author = {Schirmeier, Horst and Neuhalfen, Jens and Korb, Ingo and Spinczyk, Olaf and Engel, Michael}, title = {RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers}, booktitle = {Proceedings of the 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2011)}, year = {2011}, address = {Pasadena, USA}, organization = {IEEE Computer Society Press}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-prdc-schirmeier.pdf}, confidential = {n}, abstract = {Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64- based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.}, } Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64- based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.
	Jörg Henkel, Lars Bauer, Joachim Becker, Oliver Bringmann, Uwe Brinkschulte, Samarjit Chakraborty, Michael Engel, Rolf Ernst, Hermann Härtig, Lars Hedrich, Andreas Herkersdorf, Rüdiger Kapitza, Daniel Lohmann, Peter Marwedel, Marco Platzner, Wolfgang Rosenstiel, Ulf Schlichtmann, Olaf Spinczyk, Mehdi Tahoori, Jürgen Teich, Norbert Wehn and Hans-Joachim Wunderlich. Design and Architectures for Dependable Embedded Systems. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) Taipei, Taiwan, October 2011 [BibTeX][PDF][Abstract] @inproceedings { SPP1500:11, author = {Henkel, J{\"o}rg and Bauer, Lars and Becker, Joachim and Bringmann, Oliver and Brinkschulte, Uwe and Chakraborty, Samarjit and Engel, Michael and Ernst, Rolf and H{\"a}rtig, Hermann and Hedrich, Lars and Herkersdorf, Andreas and Kapitza, R{\"u}diger and Lohmann, Daniel and Marwedel, Peter and Platzner, Marco and Rosenstiel, Wolfgang and Schlichtmann, Ulf and Spinczyk, Olaf and Tahoori, Mehdi and Teich, J{\"u}rgen and Wehn, Norbert and Wunderlich, Hans-Joachim}, title = {Design and Architectures for Dependable Embedded Systems}, booktitle = {Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)}, year = {2011}, address = {Taipei, Taiwan}, month = {oct}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-esweek-marwedel.pdf}, confidential = {n}, abstract = {The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a `dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classication on faults, errors, and failures.}, } The paper presents an overview of a major research project on dependable embedded systems that has started in Fall 2010 and is running for a projected duration of six years. Aim is a `dependability co-design' that spans various levels of abstraction in the design process of embedded systems starting from gate level through operating system, applications software to system architecture. In addition, we present a new classication on faults, errors, and failures.
	Michael Engel, Florian Schmoll, Andreas Heinig and Peter Marwedel. Unreliable yet Useful -- Reliability Annotations for Data in Cyber-Physical Systems. In Proceedings of the 2011 Workshop on Software Language Engineering for Cyber-physical Systems (WS4C) Berlin / Germany, October 2011 [BibTeX][PDF][Abstract] @inproceedings { engel:11:ws4c, author = {Engel, Michael and Schmoll, Florian and Heinig, Andreas and Marwedel, Peter}, title = {Unreliable yet Useful -- Reliability Annotations for Data in Cyber-Physical Systems}, booktitle = {Proceedings of the 2011 Workshop on Software Language Engineering for Cyber-physical Systems (WS4C)}, year = {2011}, address = {Berlin / Germany}, month = {oct}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-ws4c-engel.pdf}, confidential = {n}, abstract = {Today, cyber-physical systems face yet another challenge in addition to the traditional constraints in energy, computing power, or memory. Shrinking semiconductor structure sizes and supply voltages imply that the number of errors that manifest themselves in a system will rise significantly. Most CP systems have to survive errors, but many systems do not have sufficient resources to correct all errors that show up. Thus, it is important to spend the available resources on handling errors with the most critical effect. We propose an ``unreliability'' annotation for data types in C programs that indicates if an error showing up in a specific variable or data structure will possibly cause a severe problem like a program crash or might only show rather negligible effects, e.g., a discolored pixel in video decoding. This classification of data is supported by static analysis methods that verify if the value contained in a variable marked as unreliable does not end up as part of a critical operation, e.g., an array index or loop termination condition. This classification enables several approaches to flexible error handling. For example, a CP system designer might choose to selectively safeguard variables marked as non-unreliable or to employ memories with different reliabilty properties to store the respective values.}, } Today, cyber-physical systems face yet another challenge in addition to the traditional constraints in energy, computing power, or memory. Shrinking semiconductor structure sizes and supply voltages imply that the number of errors that manifest themselves in a system will rise significantly. Most CP systems have to survive errors, but many systems do not have sufficient resources to correct all errors that show up. Thus, it is important to spend the available resources on handling errors with the most critical effect. We propose an "unreliability" annotation for data types in C programs that indicates if an error showing up in a specific variable or data structure will possibly cause a severe problem like a program crash or might only show rather negligible effects, e.g., a discolored pixel in video decoding. This classification of data is supported by static analysis methods that verify if the value contained in a variable marked as unreliable does not end up as part of a critical operation, e.g., an array index or loop termination condition. This classification enables several approaches to flexible error handling. For example, a CP system designer might choose to selectively safeguard variables marked as non-unreliable or to employ memories with different reliabilty properties to store the respective values.
	Michael Engel, Florian Schmoll, Andreas Heinig and Peter Marwedel. Temporal Properties of Error Handling for Multimedia Applications. In Proceedings of the 14th ITG Conference on Electronic Media Technology Dortmund / Germany, February 2011 [BibTeX][PDF][Abstract] @inproceedings { engel:11:itg, author = {Engel, Michael and Schmoll, Florian and Heinig, Andreas and Marwedel, Peter}, title = {Temporal Properties of Error Handling for Multimedia Applications}, booktitle = {Proceedings of the 14th ITG Conference on Electronic Media Technology}, year = {2011}, address = {Dortmund / Germany}, month = {feb}, keywords = {ders}, file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2011-itg-engel.pdf}, confidential = {n}, abstract = {In embedded consumer electronics devices, cost pressure is one of the driving design objectives. Devices that handle multimedia information, like DVD players or digital video cameras require high computing performance and real- time capabilities while adhering to the cost restrictions. The cost pressure often results in system designs that barely exceed the minimum requirements for such a system. Thus, hardware-based fault tolerance methods frequently are ignored due to their cost overhead. However, the amount of transient faults showing up in semiconductor-based systems is expected to increase sharply in the near future. Thus, low- overhead methods to correct related errors in such systems are required. Considering restrictions in processing speed, the real-time properties of a system with added error handling are of special interest. In this paper, we present our approach to flexible error handling and discuss the challenges as well as the inherent timing dependencies to deploy it in a typical soft real- time multimedia system, a H.264 video decoder.}, } In embedded consumer electronics devices, cost pressure is one of the driving design objectives. Devices that handle multimedia information, like DVD players or digital video cameras require high computing performance and real- time capabilities while adhering to the cost restrictions. The cost pressure often results in system designs that barely exceed the minimum requirements for such a system. Thus, hardware-based fault tolerance methods frequently are ignored due to their cost overhead. However, the amount of transient faults showing up in semiconductor-based systems is expected to increase sharply in the near future. Thus, low- overhead methods to correct related errors in such systems are required. Considering restrictions in processing speed, the real-time properties of a system with added error handling are of special interest. In this paper, we present our approach to flexible error handling and discuss the challenges as well as the inherent timing dependencies to deploy it in a typical soft real- time multimedia system, a H.264 video decoder.

2010

Andreas Heinig, Michael Engel, Florian Schmoll and Peter Marwedel.
Using Application Knowledge to Improve Embedded Systems Dependability.
In Proceedings of the Workshop on Hot Topics in System Dependability (HotDep 2010)
Vancouver, Canada, October 2010
[BibTeX][PDF]

@inproceedings { heinig:10:hotdep,
  author = {Heinig, Andreas and Engel, Michael and Schmoll, Florian and Marwedel, Peter},
  title = {Using Application Knowledge to Improve Embedded Systems Dependability},
  booktitle = {Proceedings of the Workshop on Hot Topics in System Dependability (HotDep 2010)},
  year = {2010},
  address = {Vancouver, Canada},
  month = {oct},
  publisher = {USENIX Association},
  keywords = {ders},
  file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-heinig-hotdep.pdf},
  confidential = {n},
}

Andreas Heinig, Michael Engel, Florian Schmoll and Peter Marwedel.
Improving Transient Memory Fault Resilience of an H.264 Decoder.
In Proceedings of the Workshop on Embedded Systems for Real-time Multimedia (ESTIMedia 2010)
Scottsdale, AZ, USA, October 2010
[BibTeX][PDF]

@inproceedings { heinig:10:estimedia,
  author = {Heinig, Andreas and Engel, Michael and Schmoll, Florian and Marwedel, Peter},
  title = {Improving Transient Memory Fault Resilience of an H.264 Decoder},
  booktitle = {Proceedings of the Workshop on Embedded Systems for Real-time Multimedia (ESTIMedia 2010)},
  year = {2010},
  address = {Scottsdale, AZ, USA},
  month = {oct},
  publisher = {IEEE Computer Society Press},
  keywords = {ders},
  file = {http://ls12-www.cs.tu-dortmund.de/daes/media/documents/publications/downloads/2010-heinig-estimedia.pdf},
  confidential = {n},
}

Sprungmarken

Servicenavigation

Search >

Hauptnavigation

You are here:

Bereichsnavigation

Hauptinhalt

LS12 Publications on Dependable Embedded Systems