Data science is emerging as a critical area of research and technology to advance scientific discovery, knowledge, and decision-making through systematic computational approaches to analyzing massive data sets. The sheer volume of data increase, coupled with the highly distributed and heterogeneous nature of scientific data sets, is requiring new approaches to how the data will be ultimately managed and analyzed. This requires evaluating the scalability and distribution of complex software architectures. DAWN (Distributed Analytics, Workflows and Numeric) is a model for simulating the execution of data processing workflows on arbitrary data system architectures. DAWN was developed to provide NASA and the scientific community at large with an additional tool to prepare for the upcoming Big Data deluge in science.
The effective and timely processing of these enormous data streams will require the design of new system architectures, data reduction algorithms, and evaluation of tradeoffs between data reduction and the uncertainty of results. Yet, at present, no software tool exists that allows simulation of complex data processing workflows to determine the computational, network, and storage resources needed to prepare data for scientific analysis. DAWN was developed purposely to fill this gap.
Possible applications of the software include selecting the best system architecture given a fixed set of resources (i.e. a collection of servers and the network connecting them), identifying the resources (servers and network) needed to process a given data volume/stream in a target time, and evaluating tradeoffs between processing reduced data volumes and the consequent increase in the uncertainty of results.
The DAWN modeling software is discipline agnostic. DAWN can be applied to any data processing use case; in particular, to any scientific data processing workflow. Execution of the software is fast, since DAWN simulates data processing rather than executing it. The software is designed so that users can extend the base framework to provide custom implementations for data processing, data transfer, resource management, etc.
DAWN execution is driven by use cases specified in XML format, conforming to a specific schema developed to describe system architectures, including topologies and workflows. It can be tuned using custom values for several system parameters, such as the processing power of each server, or the network speed connecting two servers. It can be run multiple times by varying the system parameters, or by random sampling of their values to simulate the variability of physical system components (such as concurrent network usage).
Because DAWN does not make any assumption on the specific algorithms and workflows that are run as part of the simulation, it can be applied to the analysis of system architecture in any scientific discipline, including processing of data collected by past, present, and future NASA remote observing instruments.
This work was done by Luca Cinquini, Daniel J. Crichton, Richard J. Doyle, Amy J. Braverman, Michael J. Turmon, Thomas Fuchs, Kyo Lee, and Ashish Mahabal of Caltech for NASA’s Jet Propulsion Laboratory. This software is available for commercial licensing. Please contact Dan Broderick at
This Brief includes a Technical Support Package (TSP).

DAWN: a Simulation Model for Evaluating Costs and Tradeoffs of Big Data Science Architectures
(reference NPO49791) is currently available for download from the TSP library.
Don't have an account?
Overview
The document presents insights from the ExArch Meeting held in October 2012, focusing on the development of DAWN, a simulation model designed to evaluate the costs and tradeoffs associated with big data science architectures. As scientific research faces an impending surge in data volume—expected to increase by 10 to 100 times—DAWN aims to address the challenges of processing and analyzing vast datasets efficiently.
Key applications of DAWN include selecting optimal system architectures based on fixed resources, identifying necessary resources for processing specific data volumes within target times, and evaluating tradeoffs between data reduction and uncertainty in results. The model is discipline-agnostic, meaning it can be applied across various scientific fields, and it simulates data processing without executing it, allowing for rapid assessments of different scenarios.
The document highlights the need for new system architectures and data reduction algorithms to manage the anticipated data deluge from projects like the Climate Model Inter-Comparison Project (CMIP6) and the Square Kilometre Array (SKA). Currently, no software tool exists that can simulate complex data processing workflows to determine the computational, network, and storage resources required for scientific analysis.
In the context of medical science, the document discusses the identification of cancer biomarkers through intensive data processing pipelines as part of the Early Detection Research Network (EDRN) project. The Clinical Proteomics Tumor Analysis Consortium (CPTAC) data processing pipeline involves a sequence of 13 tasks executed on both Linux and Windows servers, taking over 50 days when run sequentially. The performance gains from running the pipeline on a system of servers and executing tasks in parallel are significant, with a reported factor of 18 improvement achieved by parallel execution.
The conclusions drawn from the DAWN analysis led to the establishment of an internal cloud infrastructure, which includes multiple servers optimized for specific tasks. This setup resulted in a substantial reduction in processing time, demonstrating the effectiveness of leveraging powerful hardware and parallel processing.
Future work aims to enhance DAWN's capabilities by enabling the simulation of sub-workflows, dynamic resource allocation, and the development of user-friendly interfaces. The goal is to make DAWN an open-source tool available on platforms like GitHub, facilitating its use in the big data era of scientific research.

