Traditionally, nodes in a sensor network simply collect data and then pass it on to a centralized node that archives, distributes, and possibly analyzes the data. However, analysis at the individual nodes could enable faster detection of anomalies or other interesting events, as well as faster responses such as sending out alerts or increasing the data collection rate. There is an additional opportunity for increased performance if individual nodes can communicate directly with their neighbors.
Previously, a method was developed by which machine learning classification algorithms could collaborate to achieve high performance autonomously (without requiring human intervention). This method worked for supervised learning algorithms, in which labeled data is used to train models. The learners collaborated by exchanging labels describing the data. The new advance enables clustering algorithms, which do not use labeled data, to also collaborate. This is achieved by defining a new language for collaboration that uses pair-wise constraints to encode useful information for other learners. These constraints specify that two items must, or cannot, be placed into the same cluster. Previous work has shown that clustering with these constraints (in isolation) already improves performance.
In the problem formulation, each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. Each learner clusters its data and then selects a pair of items about which it is uncertain and uses them to query its neighbors. The resulting feedback (a “must” and “cannot” constraint from each neighbor) is combined by the learner into a consensus constraint, and it then re-clusters its data while incorporating the new constraint. A strategy was also proposed for “cleaning” the resulting constraint sets, which may contain conflicting constraints; this improves performance significantly. This approach has been applied to collaborative clustering of seismic and infrasonic data collected by the Mount Erebus Volcano Observatory in Antarctica.
Previous approaches to distributed clustering cannot readily be applied in a sensor network setting, because they assume that each node has the same “view” of the data set. A view is the set of features used to represent each object. When a single data set is partitioned across several computational nodes, distributed clustering works; all objects have the same view. But when the data is collected from different locations, using different sensors, a more flexible approach is needed. This approach instead operates in situations where the data collected at each node has a different view (e.g., seismic vs. infrasonic sensors), but they observe the same events. This enables them to exchange information about the likely cluster membership relations between objects, even if they do not use the same features to represent the objects.