Health Manager can detect “Bad Health” prior to a failure occurring by periodically monitoring the application software by looking for code corruption errors, and sanity-checking each critical data value prior to use. A processor’s memory can fail and corrupt the software, or the software can accidentally write to the wrong address and overwrite the executing software. This innovation will continuously calculate a checksum of the software load to detect corrupted code. This will allow a system to detect a failure before it happens.
This innovation monitors each software task (thread) so that if any task reports “bad health,” or does not report to the Health Manager, the system is declared bad. The Health Manager reports overall system health to the outside world by outputting a square wave signal. If the square wave stops, this indicates that system health is bad or hung and cannot report. Either way, “bad health” can be detected, whether caused by an error, corrupted data, or a hung processor.
A separate Health Monitor Task is started and run periodically in a loop that starts and stops pending on a semaphore. Each monitored task registers with the Health Manager, which maintains a count for the task. The registering task must indicate if it will run more or less often than the Health Manager. If the task runs more often than the Health Manager, the monitored task calls a health function that increments the count and verifies it did not go over max-count. When the periodic Health Manager runs, it verifies that the count did not go over the max-count and zeroes it. If the task runs less often than the Health Manager, the periodic Health Manager will increment the count. The monitored task zeroes the count, and both the Health Manager and monitored task verify that the count did not go over the max-count.
The Health Manager reports its system health status to the outside world by toggling an output pin creating a square wave signal. If the system hangs completely prior to reporting its health status, the square wave is no longer generated. This absence of the square wave, whether intentional or because the Health Manager is hung, indicates bad health, analogous to a deadman switch. This is done by creating a Health Manager Reporting Task, which loops and pends on a semaphore. A timer Interrupt Service Routine gives the semaphore that allows the Health Manager to run. When the Health Manager Reporting Task receives the semaphore, it reads the system health status. If the status is good, an output pin is toggled. If the status is bad health, it latches the system’s bad health variable so it can never switch back to good health and stops the square wave.
This work was done by Roger Zoerner of Kennedy Space Center. KSC-12809