Ordering data from most to least useful replaces quality flags, improves climate science results, prioritizes images for analysis, and guides analysts for optimal data filtration.
Observations in modern datasets have a continuum of quality that can be hard to quantify. For example, satellite observations are subject to often-subtle mixtures of confounding forces that distort the observation’s utility to a varying extent. For the Orbiting Carbon Observatory-2 (OCO-2) observatory, effects such as cloud cover, aerosols in the atmosphere, and surface roughness are three major confounding forces that can mildly, heavily, or totally confound an observation’s utility. These complicating factors are not present in a binary fashion: clouds can cover a percentage of the scene, have variable opacity, and differing topology. Arbitrary thresholds are traditionally placed on the presence of such forces to yield a binary good/bad data flag for each observation. By instead generating a data ordering, users are guided towards the most reliable data first, followed by increasingly challenging observations. No harsh on/off threshold is applied to the data, potentially obscuring useful data to one user while leaving in confounded observations to another. Allowing users to create custom filters based on DOGO’s data ordering leaves hard cutoff decisions in the hands of users, guided but not restricted by the project’s expert knowledge.
Traditionally, quality flags provided a binary yes/no estimation of a datapoint’s utility. Normally, scientists would first discard all “bad data” so indicated, and only work with the “good data” as defined by the project. However, in modern instrumentation, there is access to significant auxiliary information for each datapoint that enables prediction of the likely utility of the observation with finer resolution than 0 or 1. To do this, many different filters are developed that become increasingly more stringent in terms of a goal metric of data quality. With this sorted list of filters, each datapoint can be assigned a single integer ranging from 0 to 19, indicating how many of the filters would reject it. Ordering the data in terms of these integers communicates to the user the order in which they should be preferred, without actually filtering away any observations. A user is then free to define their own filter based on the integer range they accept, and rapidly communicate this dataset to another collaborator.
These ordering integers are called Warn Levels, and they can be developed for any metadata-rich data source to help guide researchers in proper data filtration. One application of Warn Levels requiring spatial uniformity, minimized likelihood of convergence failure, and minimum scatter is the need to preferentially select only “the best” data to process in real time from the OCO-2 mission, enabling its level 1 requirement that at least 6% of the streaming data is successfully processed as quickly as possible. A second, related application produces Warn Levels that help users know the order in which to ingest the mission’s output data for their analysis, in effect forming a “tunable filter” that lets users decide how much data to accept.
The algorithm described here is not simply a Warn Level generator for OCO-2, but rather an entire method to construct new Warn Levels for any metadata-rich data source. It is a genetic algorithm coupled with a voting scheme and feature selection for use on a supercomputer that explores the large dimensional space of all possible filters and combinations of filters to yield the best-performing singleton, pair, triple, etc. filters. These are then folded into a Warn Level estimate. By exploring all possible filters and then folding this information into a single data ordering, one is able to achieve far more than even an optimum quality flag could provide.
Moreover, during the creation of the final Warn Level ordering, a necessary exploration of the precise metadata that strongly predicts retrieval confounding yields great project insight onto sources of error. These can and were used to guide algorithmic improvements, a-priori tuning, and atmospheric science interpretation of the retrieval algorithm’s behavior while yielding quick detection of serious yet subtle code abnormalities. In fact, this early “feature selection” phase may yield even more useful information and guidance than the final Warn Levels themselves.
This algorithm has been significantly sped up, further adapted to take advantage of OCO-2 data, minimized its footprint on the cluster computer hosting it, and processed its output into a more immediately interpretable form.
This work was done by Lukas Mandrake of Caltech for NASA’s Jet Propulsion Laboratory.