Generic, Extensible, Configurable Push-Pull Framework for Large-Scale Science Missions
- Created: Friday, 01 July 2011
This framework also has been evaluated for data dissemination supporting the National Cancer Institute’s early cancer detection research network.
The push-pull framework was developed in hopes that an infrastructure would be created that could literally connect to any given remote site, and (given a set of restrictions) download files from that remote site based on those restrictions.
The Cataloging and Archiving Service (CAS) has recently been re-architected and re-factored in its canonical services, including file management, workflow management, and resource management. Additionally, a generic CAS Crawling Framework was built based on motivation from Apache’s open-source search engine project called Nutch. Nutch is an Apache effort to provide search engine services (akin to Google), including crawling, parsing, content analysis, and indexing. It has produced several stable software releases, and is currently used in production services at companies such as Yahoo, and at NASA’s Planetary Data System.
The CAS Crawling Framework supports many of the Nutch Crawler’s generic services, including metadata extraction, crawling, and ingestion. However, one service that was not ported over from Nutch is a generic protocol layer service that allows the Nutch crawler to obtain content using protocol plug-ins that download content using implementations of remote protocols, such as HTTP, FTP, WinNT file system, HTTPS, etc. Such a generic protocol layer would greatly aid in the CAS Crawling Framework, as the layer would allow the framework to generically obtain content (i.e., data products) from remote sites using protocols such as FTP and others. Augmented with this capability, the Orbiting Carbon Observatory (OCO) and NPP (NPOESS Preparatory Project) Sounder PEATE (Product Evaluation and Analysis Tools Elements) would be provided with an infrastructure to support generic FTP-based pull access to remote data products, obviating the need for any specialized software outside of the context of their existing process control systems.
This extensible configurable framework was created in Java, and allows the use of different underlying communication middleware (at present, both XML-RPC, and RMI). In addition, the framework is entirely suitable in a multi-mission environment and is supporting both NPP Sounder PEATE and the OCO Mission. Both systems involve tasks such as high-throughput job processing, terabyte-scale data management, and science computing facilities. NPP Sounder PEATE is already using the push-pull framework to accept hundreds of gigabytes of IASI (infrared atmospheric sounding interferometer) data, and is in preparation to accept CRIMS (Cross-track Infrared Microwave Sounding Suite) data. OCO will leverage the framework to download MODIS, CloudSat, and other ancillary data products for use in the high-performance Level 2 Science Algorithm.
The National Cancer Institute is also evaluating the framework for use in sharing and disseminating cancer research data through its Early Detection Research Network (EDRN).