The objective of this invention is to transform applications that process big data in arbitrary formats using the MapReduce parallel programming model and executed on the open-source Hadoop platform. Many legacy data-intensive applications do not scale out of a single server. Some applications can load-balance to multicore, but scalability is limited by intensive IO. Hadoop/MapReduce is a massively scalable programming paradigm modeled after Google’s implementation. There are many use cases to show Hadoop/MapReduce can scale linearly to hundreds or even thousands of servers for big data analysis. The problem is how to transparently transform a legacy serial data processing code to take full advantage of Hadoop/MapReduce parallel processing framework.
Hadoop/MapReduce was conceived in Web data analysis, where most data are in text format. Therefore, Hadoop comes with strong (compressed) text data processing capabilities. However, most scientific applications handle data in binary formats. Previous efforts porting an application to Hadoop basically take two approaches: dump binary data to text format and use Hadoop’s built-in Text-InputFormat to process data, or develop customized InputFormat, one per binary data type. The first approach is neither time- nor space-efficient because a text dump is much larger than the correspond ing binary dataset, and it will break structured information encoded in a binary dataset, requiring rewriting the entire processing code to parse text data. The second approach is efficient for one given data format, but not generic, considering there are hundreds (if not thousands) of domain-specific binary data formats. It requires development of a customized Hadoop Input Format (in JAVA), one for each format, which might be a daunting task for non-Hadoop developers.
This innovation includes a suite of data encoding and special handling that allows users to port application processing data in arbitrary (binary) format to Hadoop/MapReduce, and best utilize the underlying scalable distributed computing mechanism with minimal code change. The key is to use Hadoop SequenceFile to encode a binary dataset, where the filename is the key and file content (bytearray) is the value so data processing can fit into the MapReduce programming model.
If a data file is much smaller than HDFS block size (e.g., 64MB), small data files should be tarred first into one file, and then converted to a Hadoop-Sequence file. This will reduce memory consumption at namenode for managing a large number of small files. The number of files in a tar file should be controlled such that the converted SequenceFile should be no larger than the default HDFS block size. This will maximize performance by not fetching data remotely, and batch loading as much data as possible from the local node.