A computational method, SimLearn, has been devised to facilitate efficient knowledge discovery from simulators. Simulators are complex computer programs used in science and engineering to model diverse phenomena such as fluid flow, gravitational interactions, coupled mechanical systems, and nuclear, chemical, and biological processes. SimLearn uses active-learning techniques to efficiently address the "landscape characterization problem." In particular, SimLearn tries to determine which regions in "input space" lead to a given output from the simulator, where "input space" refers to an abstraction of all the variables going into the simulator, e.g., initial conditions, parameters, and interaction equations. Landscape characterization can be viewed as an attempt to invert the forward mapping of the simulator and recover the inputs that produce a particular output.

Given that a single simulation run can take days or weeks to complete even on a large computing cluster, SimLearn attempts to reduce costs by reducing the number of simulations needed to effect discoveries. Unlike conventional data-mining methods that are applied to static predefined datasets, SimLearn involves an iterative process in which a "most informative" dataset is constructed dynamically by using the simulator as an oracle. On each iteration, the algorithm models the knowledge it has gained through previous simulation trials and then chooses which simulation trials to run next. Running these trials through the simulator produces new data in the form of input-output pairs.

The overall process is embodied in an algorithm that combines support vector machines (SVMs) with active learning. SVMs use learning from examples (the examples are the input-output pairs generated by running the simulator) and a principle called maximum margin to derive predictors that generalize well to new inputs. In SimLearn, the SVM plays the role of modeling the knowledge that has been gained through previous simulation trials. Active learning is used to determine which new input points would be most informative if their output were known. The selected input points are run through the simulator to generate new information that can be used to refine the SVM. The process is then repeated. SimLearn carefully balances exploration (semi-randomly searching around the input space) versus exploitation (using the current state of knowledge to conduct a tightly focused search).

During each iteration, SimLearn uses not one, but an ensemble of SVMs. Each SVM in the ensemble is characterized by different hyperparameters that control various aspects of the learned predictor — for example, whether the predictor is constrained to be very smooth (nearby points in input space lead to similar output predictions) or whether the predictor is allowed to be "bumpy." The various SVMs will have different preferences about which input points they would like to run through the simulator next. SimLearn includes a formal mechanism for balancing the ensemble SVM preferences so that a single choice can be made for the next set of trials.

Initial tests with two real-world scientific simulators have shown that SimLearn is effective in reducing the number of trials needed to accurately identify the regions of input space leading to particular output behaviors. In the first application involving simulations of collisions between asteroids and the gravitational interactions between the resulting fragments, parameters of the two colliding asteroids that lead to binary pairs (gravitationally bound fragments in orbit around a common center of mass) were identified using only half the simulation trials needed to obtain equivalent knowledge from a grid-based sampling approach. In the second application involving simulations of the Earth's magnetosphere, there was a corresponding reduction by a factor of six in the number of simulation trials required.

This work was performed by Michael Burl, Dennis DeCoste, Dominic Mazzoni, and Lucas Scharenbroich of Caltech and Brian Enke and William Merline of the Southwest Research Institute for NASA's Jet Propulsion Laboratory. For more information, download the Technical Support Package (free white paper) at www.techbriefs.com/tsp under the Information Sciences category.

The software used in this innovation is available for commercial licensing. Please contact Karina Edmonds of the California Institute of Technology at (626) 395-2322. Refer to NPO-43399.