Computations on parallel processors can continue, even if one processor fails.
When an algorithm is distributed across multiple threads executing on many distinct processors, a loss of one of those threads or processors can potentially result in the total loss of all the incremental results up to that point. When implementation is massively hardware distributed, then the probability of a hardware failure during the course of a long execution is potentially high. Traditionally, this problem has been addressed by establishing checkpoints where the current state of some or part of the execution is saved. Then in the event of a failure, this state information can be used to recompute that point in the execution and resume the computation from that point.
A serious problem arises when one distributes a problem across multiple threads and physical processors is that one increases the likelihood of the algorithm failing due to no fault of the scientist but as a result of hardware faults coupled with operating system problems. With good reason, scientists expect their computing tools to serve them and not the other way around.
What is novel here is a unique combination of hardware and software that reformulates an application into monolithic structure that can be monitored in real-time and dynamically reconfigured in the event of a failure.
This unique reformulation of hardware and software will provide advanced aeronautical technologies to meet the challenges of next-generation systems in aviation, for civilian and scientific purposes, in our atmosphere and in atmospheres of other worlds. In particular, with respect to NASA’s manned flight to Mars, this technology addresses the critical requirements for improving safety and increasing reliability of manned spacecraft
This work was done by Mark James of Caltech for NASA’s Jet Propulsion Laboratory. For further information, access the Technical Support Package (TSP) free on-line at www.techbriefs.com/tsp under the Information Sciences category.
In accordance with Public Law 96-517, the contractor has elected to retain title to this invention. Inquiries concerning rights for its commercial use should be addressed to:
Innovative Technology Assets Management
Mail Stop 202-233
4800 Oak Grove Drive
Pasadena, CA 91109-8099
Refer to NPO-42554, volume and number of this NASA Tech Briefs issue, and the page number.
This Brief includes a Technical Support Package (TSP).
Integrated Hardware and Software for No-Loss Computing (reference NPO-42554) is currently available for download from the TSP library.
Please Login at the top of the page to download.