A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors.

Faults Are Detected in this prototype system by comparison of the outputs of the two processors, which are embedded in a single FPGA. The legend “FI” denotes locations where faults are inserted for testing purpose.

A working prototype (see figure) consists of two embedded IBM PowerPC®405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

This work was done by Gary Bolotin, Robert Watson, Sunant Katanyoutanant, Gary Burke, and Mandy Wang of Caltech for NASA’s Jet Propulsion Laboratory. For further information, access the Technical Support Package (TSP) free on-line at www.techbriefs.com/tsp under the Semiconductors & ICs category. NPO-40575



This Brief includes a Technical Support Package (TSP).
Document cover
Multiple Embedded Processors for Fault-Tolerant Computing

(reference NPO-40575) is currently available for download from the TSP library.

Don't have an account?



Magazine cover
NASA Tech Briefs Magazine

This article first appeared in the December, 2005 issue of NASA Tech Briefs Magazine (Vol. 29 No. 12).

Read more articles from the archives here.


Overview

The document titled "Technical Support Package for Multiple Embedded Processors for Fault-Tolerant Computing" is a pre-decisional draft prepared under the sponsorship of NASA. It outlines advancements in fault-tolerant computing, particularly focusing on the use of multiple embedded processors, specifically the PowerPC (PPC) 405 architecture, to enhance reliability in aerospace applications.

The document includes various sections detailing simulation waveforms, both with and without fault injection, which are crucial for testing the robustness of the systems being developed. It emphasizes the importance of simulating different operational scenarios to ensure that the systems can handle faults effectively.

Future work outlined in the document includes several redundancy schemes aimed at improving fault tolerance. These include the Pair-and-Spare Mechanism (differentiating between hot and cold pairs), Triple Module Redundancy (TMR), and Quadruple Module Redundancy schemes. Additionally, it mentions the development of software recovery mechanisms to support these redundancy strategies, which are essential for maintaining system functionality in the event of hardware failures.

The document also features a simulation model based on the ML300 design, detailing the architecture and components involved, such as the PLB (Processor Local Bus), OPB (On-chip Peripheral Bus), and various controllers for memory and external devices. The architecture is designed to facilitate efficient communication between processors and peripherals while ensuring data integrity through error detection and correction mechanisms.

A two-processor comparison block diagram is provided, illustrating the configuration of primary and secondary PPC 405 processors, including their cache units, memory management units (MMUs), and arbiter components. This diagram serves to highlight the system's design for fault tolerance, showcasing how multiple processors can work together to mitigate the impact of potential failures.

Overall, the document serves as a comprehensive overview of the ongoing research and development efforts in fault-tolerant computing within the aerospace sector. It aims to make the results of these developments accessible for broader technological, scientific, and commercial applications, reflecting NASA's commitment to advancing aerospace technology while ensuring safety and reliability in critical systems. Further information and assistance can be obtained through NASA's Scientific and Technical Information Program Office.