When an algorithm is distributed across multiple threads executing on many distinct processors, a loss of one of those threads or processors can potentially result in the total loss of all the incremental results up to that point. When implementation is massively hardware distributed, then the probability of a hardware failure during the course of a long execution is potentially high. Traditionally, this problem has been addressed by establishing checkpoints where the current state of some or part of the execution is saved. Then in the event of a failure, this state information can be used to recompute that point in the execution and resume the computation from that point.
A serious problem arises when one distributes a problem across multiple threads and physical processors is that one increases the likelihood of the algorithm failing due to no fault of the scientist but as a result of hardware faults coupled with operating system problems. With good reason, scientists expect their computing tools to serve them and not the other way around.
What is novel here is a unique combination of hardware and software that reformulates an application into monolithic structure that can be monitored in real-time and dynamically reconfigured in the event of a failure.
This unique reformulation of hardware and software will provide advanced aeronautical technologies to meet the challenges of next-generation systems in aviation, for civilian and scientific purposes, in our atmosphere and in atmospheres of other worlds. In particular, with respect to NASA's manned flight to Mars, this technology addresses the critical requirements for improving safety and increasing reliability of manned spacecraft
This work was done by Mark James of Caltech for NASA's Jet Propulsion Laboratory. For further information, access the Technical Support Package (TSP) free on-line at www.techbriefs.com/tsp under the Information Sciences category.
In accordance with Public Law 96-517, the contractor has elected to retain title to this invention. Inquiries concerning rights for its commercial use should be addressed to:
Innovative Technology Assets Management
JPL
Mail Stop 202-233
4800 Oak Grove Drive
Pasadena, CA 91109-8099
(818) 354-2240
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.
Refer to NPO-42554, volume and number of this NASA Tech Briefs issue, and the page number.
This Brief includes a Technical Support Package (TSP).

Integrated Hardware and Software for No-Loss Computing
(reference NPO-42554) is currently available for download from the TSP library.
Don't have an account?
Overview
The document discusses an innovative approach to "No-Loss Computing" developed by NASA's Jet Propulsion Laboratory (JPL). It addresses the challenges associated with massively distributed computing systems, particularly in the context of scientific applications that require high levels of detail and accuracy. As computational demands grow, the likelihood of hardware failures during long executions increases, which can lead to significant data loss if not properly managed.
To mitigate this risk, the document highlights the traditional method of establishing checkpoints. These checkpoints save the current state of the computation, allowing for recovery and resumption in the event of a failure. However, when algorithms are distributed across multiple threads and processors, the loss of a single thread or processor can result in the total loss of all incremental results up to that point. This presents a critical challenge for scientists who rely on their computing tools to function reliably.
The proposed solution combines unique hardware and software to reformulate applications into a monolithic structure that can be monitored in real-time. This dynamic reconfiguration capability allows the system to respond to failures effectively, thereby enhancing the reliability and safety of manned spacecraft and other advanced aeronautical technologies. The document emphasizes the importance of this technology for future missions, including NASA's manned flight to Mars, where safety and reliability are paramount.
In summary, the document outlines a comprehensive strategy for improving the resilience of distributed computing systems through integrated hardware and software solutions. By addressing the inherent risks of hardware failures and providing mechanisms for real-time monitoring and recovery, this approach aims to support the growing complexity of scientific computations and ensure the success of critical aerospace missions. The insights provided in this Technical Support Package are part of NASA's broader efforts to make aerospace-related developments available for wider technological, scientific, and commercial applications.

