For decades, software developers were used to the so-called »free lunch« — and we are not talking about company-provided catering here. Rather, they were in the comfortable situation to assume that the hardware their software is being executed on is reliable, i.e., a correctly specified operation produces deterministic results. However, the signs are on the wall telling us that this period of reliance on correct hardware behavior may once and for all be over. Especially in embedded systems, technology advances like reduced supply voltages and shrinking semiconductor structural sizes as well as the ever-increasing cost pressure and shortened development cycles will lead to an increase in transient and permanent errors showing up in systems. Today, the effects of hardware errors are often neglected by system designers; in the not-too-far future, however, we expect them to affect everyday operation of devices.
In contrast, computer users were never used to error-free behavior of their systems. The increasing error rates will make it harder for them to blame all mishaps on the software developers alone...
It may be surprising that transient errors are not a new research topic at all. The topic of resilience to transient memory and logic errors has regularly shown up in scientific and industry publications since at least the 1970s:
In a 2006 white paper, Tayden, Inc., analyzes the predominant causes of soft errors during the last decades:
»The radiation mechanisms that cause soft errors have been studied for the last few decades. In the late 1970s, alpha particles emitted by uranium and thorium impurities in packaging materials were the dominant cause of soft errors in DRAM devices. During the same era, studies showed that high-energy neutrons from cosmic radiation could induce soft errors via the secondary ions produced by the neutron reactions with silicon nuclei. In the mid 1990s, high-energy cosmic radiation became the dominant source of soft errors in DRAM devices. Studies have also identified a third mechanism induced by the interactions of low-energy cosmic neutrons with the boron-10 isotope, which can be found in the borophosphosilicate glass (BPSG) used in integrated circuits to form insulator layers. «
An overview of past, current and future soft error rates for different semiconductor devices can be found in Tezzaron's 2004 White Paper. This publication also contains an exhaustive lists of references.
If transient errors have plagued the industry for nearly four decades, one might ask why we are investigating the problem now, after all? This has several reasons. First, soft error mitigation today is mostly performed using hardware approaches, which may prove too costly for many embedded applications. Second, the decreasing structural sizes and supply voltages will increase the likeliness of errors. Third, high-performance embedded controllers and the current trend to multi-core embedded systems allow to pursue new methods for fault tolerance in systems that were not possible to implement previously. Finally, an emerging trend is to see faulty systems not only as a problem, but also as a novel approach to design approaches for resource-aware computing, e.g., by building more energy-efficient systems. In addition, in long-running systems, it would be useful if the software could adapt to failing parts of a system and thus provide methods to handle graceful degradation of system functionality
This final topic leads back to the beginning. The free lunch is over — software developers will have to accept that hardware errors are a fact affecting the behavior of their software. Especially in embedded real-time systems, the inherent temporal, energy, memory and cost restrictions require novel approaches. Error handling has to become application-aware; based on error semantics, more efficient error handling can be implemented. This handling will usually take place in software. Using semantics information on errors, the system software will be able to investigate how critical an error is to the system's functionality and what amount of time may be required to correct that error. Based on this information, a decision can be made which errors can be corrected at which point of time. These are the fundamental ideas the FEHLER project is based on. For a first overview of our achievements, please take a look at our publications.