Space missions requiring the highest upset rate immunity typically turn to rad-hard system designs. Unfortunately, these systems are many generations behind commercial performance, and their upset immunity is becoming less effective with the increasing number of storage elements in newer processors, such as processor cache memory. An adaptive fault-tolerant computer has been designed, built, and rad-tested using the latest, high-performance processors. The design integrates the most aggressive processor and memory error correction methods available. Potential applications for the computer include mission-critical single-board computers for spacecraft control and research missions requiring high data throughput and high computational performance without sacrificing reliability.

Figure 1: Major Blocks in the design, with error-correction methods for single-event effects.
Figure 2: A Typical Orbit and Solar Flare Environment comparison performed by NASA's JPL of the Maxwell processors with the RAD6000 processors.

The design uses latch-up immune components (component screening), >100Krad typical orbit (component screening and/or shielding), and uncorrected upset rates >300 years (typical GEO orbit for the entire board). Figure 1 shows the major blocks in the design, along with their error correction methods for single-event effects. Processors use TMR (Triple Modular Redundancy), Processor Resynchronization, and Processor Scrubbing. The SDRAMs use Double Device correction Reed-Solomon, and the FPGAs use Actel's RT-AXS (TMR hardened registers).

The processors are fully synchronous and in lock-step with each other, running at 400 to 800 MHz internally and 50 MHz externally, giving processing performance of up to 1,800 MIPS. Although three processors are used (IBM PowerPC 750FXs with built-in 512-Kbyte L2 Cache), they are higher performance, lower power consumption, and more immune (after mitigation) to upsets than one rad-hard processor. The TMR logic and control are implemented in an Actel RT-AXS FPGA and optimized to eliminate voting delays. Each clock cycle voting and detection is evaluated, an incorrect response is isolated, the upset processor is held in reset, and the upset is logged.

At periodic intervals, the contents of the processors are flushed through the TMR logic (Processor Scrubbing) and the majority vote stored in the Reed-Solomon-protected SDRAM (one validated processor image). All three processors are then reset and restored, including the upset processor (Processor Resynchronization). No roll-back is needed, and no special user software is needed for recovery — the entire system is self-correcting with less than a 1-millisecond delay for the scrubbing period. This scrubbing period is software-programmable, and can be configured to adapt to environmental changes such as increasing the scrubbing rate with increased processor upsets.

Radiation testing and analysis of the PEM (Prototype Engineering Module), which was done with the assistance of NASA's Jet Propulsion Laboratory (JPL) in Pasadena, CA, validated the upset rate predictions based upon the failure mode being the probability of two processors having an upset within the scrubbing window. All three processors were hit with heavy ions simultaneously at Texas A&M University's Cyclotron. All single processor errors were corrected. All double processor errors followed the mathematical model used in the predictions, and were detected and isolated. The board generated a self-reboot, notifying the user of the double processor upset.

Because of the severity of solar flare environments, JPL compared one of the most reliable and well-known rad-hard processors (the RAD6000) in a typical orbit as well as a worst-case solar flare environment. Figure 2 shows that under most typical orbits, the computer processor architecture demonstrated a 16,000X upset rate improvement, and even under a worst-case solar flare condition, showed a 9X improvement. The upset rates assume a 0.1-second scrubbing interval, which can be further improved by using an adaptive scrubbing method in which the scrub rate is increased as the board detects higher rates of single processor upsets (such as during a solar flare).

This article was written by Robert Hillman and Larry Longden of Maxwell Technologies, with processor upset rate analysis provided by Gary M. Swift and Farokh Irom of NASA's Jet Propulsion Laboratory, California Institute of Technology. For more information, contact Maxwell Technologies at: Tel: 858-503-3300 or visit www.maxwell.com .