The intermediate Palomar Transient Factory (iPTF) is a wide-field sky survey of the optical transient sky (e.g., supernovae, variable stars) that uses image subtraction for the discovery of astronomical transients. Astronomical transients, such as supernovae, are only observable on a timescale of weeks to months. In order to thoroughly study supernova physics, it is important to detect the supernova as early as possible to trigger follow-up assets that can begin observing the event in multiple wavelengths prior to its decline.

The process of image subtraction requires alignment and photometric matching between an input image and its associated reference. This is a difficult, complex process that is an open area of research. As a result, many artifacts of the subtraction process overwhelm the number of candidates associated with true astronomical sources. The iPTF pipeline produces on the order of 104 to 106 candidates, depending on the observed fields and conditions — a data rate that far exceeds human consumption. This mandates the use of automated vetting to filter out false positives and prioritize those candidates that are worthy of follow-up resources.

There are automated machine learning systems in use at iPTF, including one provided by NASA’s Jet Propulsion Laboratory (JPL), that are deployed at the iPTF pipeline at the National Energy Research Scientific Computing Center (NERSC). The current automated triaging systems at iPTF employ a traditional machine learning approach of training a classifier (Random Forest) to discriminate candidates extracted from subtracted images as “real” or “bogus.” However, iPTF has since developed a second pipeline at the Infrared Processing and Archiving Center (IPAC) that is more robust to the crowded fields associated with the galactic plane. A new Real-Bogus system for the pipeline at IPAC was developed.

The Real-Bogus system at IPAC is an automated decision model for iPTF that has an empirical missed detection rate (or false negative rate) of 3.5% at a 1% false positive rate. It is a system of Python libraries that runs machine learning software for the classification of transient astronomical phenomena as either real or bogus.

The system was built through the following process:

  1. Reprocess NERSC pipeline images containing spectroscopically confirmed astronomical transients and their light curve observations through the IPAC subtracted image pipeline (PTF Image Differencing and Extraction — PTFIDE).
  2. Query PTFIDE for sources within three arc seconds of the NERSC candidates and declare this set as true astronomical sources.
  3. Randomly sample the database for bogus candidates minus all confirmed reals found in step 2.
  4. Create a training set of real and bogus examples from the labeled data, leaving out approximately 1K real examples and 10K random bogus examples as independent test sets. The bogus examples are selected in a novel way compared to prior art. The database is sampled randomly to ensure consistency with the true distributions of two parameters: the number of sources per image and the limiting magnitude of the image. This method ensures a broad sample over a range of observing conditions (which limiting magnitudes serves as a proxy for) and subtraction quality (which number of sources per image serves as a proxy for).
  5. Use active learning to reduce training set contamination. Active learning is the technology underneath a Web interface that presents to a science user examples that are likely mislabeled. The reviewer identifies false negatives in the bogus sets and false positives in the real sets, and labels are updated.
  6. Using all database features available, train a random forest classifier with 300 trees. The tree should output a score based on probabilities that are calibrated from the random forest output. Measure false positive and negative rate performance via cross validation.
  7. Check histograms of scores on the independent test sets. On the 1K set of reals, scores should be skewed toward 1.0. On the 10K random set of bogus examples, scores should be skewed toward 0.0.
  8. Look at the thresholds that result in a 5% false negative rate on the independent test set of real examples, and a 1% false positive rate on the independent test set of bogus examples. Use the threshold of the 5% false negative rate only if the same threshold results in <1% false positive rate. Otherwise, use the threshold for the 1% false positive rate.
  9. Wrap the call to the random forest in a Python wrapper that takes a list of candidate IDs as input, and queries the database to build a structured feature set that is compatible with the trained classifier.

This work was done by Umaa Rebbapragada, Gary B. Doran, and Brian D. Bue of Caltech for NASA’s Jet Propulsion Laboratory. This software is available for license through the Jet Propulsion Laboratory, and you may request a license at: https://download.jpl.nasa.gov/ops/request/request_introduction.cfm . NPO-50020