Distributed systems have become an integral part of safety-critical computing applications, necessitating system designs that incorporate complex fault-tolerant resource management functions to provide globally coordinated operations with ultra-reliability. As a result, robust clock synchronization has become a required fundamental component of fault-tolerant, safety-critical distributed systems. Since physical oscillators are inherently imperfect, local clocks of nodes of a distributed system, driven by these oscillators, do not keep perfect time and can drift with respect to real time and one another. Thus, the local clocks of the nodes must periodically be re-synchronized. As a result, a fault-tolerant system needs a clock synchronization algorithm that tolerates imprecise local clocks and faulty behavior by some processes.
NASA has a strategy and an algorithm for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. The solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size.
The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. The solution does not rely on assumptions about the initial state of the system and no central clock or centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. There is also a mechanical verification of a proposed protocol. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period.