Deterministic Annealing Clustering

NASA’s Jet Propulsion Laboratory, Pasadena, California

This software partitions a dataset into clusters of data using the Deterministic Annealing algorithm, a more sophisticated clustering technique than usually available in scientific and commercial software. The results provided are deterministic, i.e. unique (within the machine precision tolerance), whereas results provided by commercially available scientific software usually are not. For instance, results provided by the popular K-means depend on random seeds and therefore necessitate multiple runs to achieve a reasonable convergence. The uniqueness or robustness of the solution is an essential feature for scientific use.

What sets this algorithm apart is its cost function and the method used to minimize the cost function. The algorithm minimizes the (cost) function F=D−TH, where D is the distortion, or distance, between the data points and the cluster candidates, and H is the entropy that measures the information content or randomness level of the data. T is a Lagrange multiplier. The methodology (and notation) is inspired from statistical physics where the function F is known as the Helmholtz free energy. The distribution minimizing F is obtained by minimizing D subject to the constraint of a given randomness level H.

The algorithm is qualified as deterministic because the distribution that minimizes F is obtained directly, i.e. analytically; this distribution is the Gibbs distribution. This feature guarantees that the solution with the deterministic algorithm will be unique, while the same on all platforms, for different users.

This software is available for license through the Jet Propulsion Laboratory, and you may request a license here . NPO-49750