Block GP is a Gaussian Process regression framework for multimodal data that can be an order of magnitude more scalable than existing state-of-the-art nonlinear regression algorithms. The framework builds local Gaussian Processes on semantically meaningful partitions of the data and provides higher prediction accuracy than a single global model with very high confidence. The method relies on approximating the covariance matrix of the entire input space by smaller covariance matrices that can be modeled independently, and can therefore be parallelized for faster execution.
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, and Earth and space sciences. In many cases, regression algorithms such as linear regression or neural networks attempt to fit the target variable as a function of the input variables without regard to the underlying joint distribution of the variables. As a result, these global models are not sensitive to variations in the local structure of the input space. Several algorithms, including the mixture of expert model, classification, and regression trees (CART), and others have been developed, motivated by the fact that a variability in the local distribution of inputs may be reflective of a significant change in the target variable.
While these methods can handle the non-stationarity in the relationships to varying degrees, they are often not scalable and, therefore, not used in large-scale data mining applications. The goal of this software is to identify such non-stationarity within a data set and build non-linear regression models for each of the unique modes found in very large data sets.
The algorithm can run GP in a distributed (parallel) architecture based on the modes/clusters identified in the data. It uses an entropy-based heuristic to model the inter-mode interactions in the data. For multimodal data, this method gives higher prediction accuracy than a single approximate model for the entire data.
Prior to the development of this software, existing software could only run centrally on a large data set and the prediction results did not have as high accuracy as this method for non-stationary data. The software has been tested on synthetic benchmark datasets and also on real satellite data for the state of California. The algorithm can predict missing (test) sensor values with very high accuracy.
This work was done by Ashok Srivastava of Ames Research Center, and Kamalika Das and Bryan Matthews of SGT, Inc. This software is available for use. To request a copy, please visit https://software.nasa.gov/software/ARC-16864-1 .