Data science is emerging as a critical area of research and technology to advance scientific discovery, knowledge, and decision-making through systematic computational approaches to analyzing massive data sets. The sheer volume of data increase, coupled with the highly distributed and heterogeneous nature of scientific data sets, is requiring new approaches to how the data will be ultimately managed and analyzed. This requires evaluating the scalability and distribution of complex software architectures. DAWN (Distributed Analytics, Workflows and Numeric) is a model for simulating the execution of data processing workflows on arbitrary data system architectures. DAWN was developed to provide NASA and the scientific community at large with an additional tool to prepare for the upcoming Big Data deluge in science.
The effective and timely processing of these enormous data streams will require the design of new system architectures, data reduction algorithms, and evaluation of tradeoffs between data reduction and the uncertainty of results. Yet, at present, no software tool exists that allows simulation of complex data processing workflows to determine the computational, network, and storage resources needed to prepare data for scientific analysis. DAWN was developed purposely to fill this gap.
Possible applications of the software include selecting the best system architecture given a fixed set of resources (i.e. a collection of servers and the network connecting them), identifying the resources (servers and network) needed to process a given data volume/stream in a target time, and evaluating tradeoffs between processing reduced data volumes and the consequent increase in the uncertainty of results.
The DAWN modeling software is discipline agnostic. DAWN can be applied to any data processing use case; in particular, to any scientific data processing workflow. Execution of the software is fast, since DAWN simulates data processing rather than executing it. The software is designed so that users can extend the base framework to provide custom implementations for data processing, data transfer, resource management, etc.
DAWN execution is driven by use cases specified in XML format, conforming to a specific schema developed to describe system architectures, including topologies and workflows. It can be tuned using custom values for several system parameters, such as the processing power of each server, or the network speed connecting two servers. It can be run multiple times by varying the system parameters, or by random sampling of their values to simulate the variability of physical system components (such as concurrent network usage).
Because DAWN does not make any assumption on the specific algorithms and workflows that are run as part of the simulation, it can be applied to the analysis of system architecture in any scientific discipline, including processing of data collected by past, present, and future NASA remote observing instruments.