When presented with a new data set, a common initial goal is to explore its contents in a discovery mode to find items of interest. However, each user who views the data set may have a different scientific goal in mind, and therefore a different desired prioritization of the items for examination. Further, as the users explore more of the data set, they accumulate concrete examples of what is or is not of interest. The goal of this work was to formalize this iterative approach to understanding large data sets, and instantiate it with methods capable of the necessary adaptation as the system iteratively acquires user feedback.
An iterative discovery solution that allows a domain expert to quickly and easily locate phenomena of interest was developed. The scientist achieves this by providing feedback on items deemed uninteresting. This “negative space” approach to the problem avoids the premature specialization of interesting judgments to one particular type of observation. It therefore addresses one of the primary obstacles to wider adoption of machine learning and other automated data analysis systems: the fear that an important discovery will be overlooked because it was not anticipated by the system designers. In seeking to understand new observations, it is the ones that cannot be predicted that may yield the most valuable new knowledge.
A machine learning solution was developed called DEMUD (Discovery through Eigenbasis Modeling of Uninteresting Data), which works by building a model of the uninteresting class and then identifying items that are maximally anomalous (and therefore likely to be interesting) with respect to that model.
For scalability to large data sets, a linear model was selected that can be easily updated as new feedback is acquired. A low-dimensional eigenbasis representation of the uninteresting items was computed via singular value decomposition. Then, the unreviewed items were ranked in terms of their reconstruction error, which indicates how different they are from the current model of “uninterestingness.” At each iteration, the topscoring observation is presented to the user to obtain feedback. Items deemed uninteresting are used to update the model. Interesting items are retained by the user.
DEMUD is innovative in its focus on modeling the uninteresting data only. By doing so, it avoids “premature specialization,” which is what happens when the system over-trains on a few examples of interesting items. Such a system can fall into the trap of looking for more items like the known ones, while remaining ignorant of other, very different items that are also interesting. Since DEMUD only learns what to ignore, it stays open to discovery throughout the learning process.