Customers that use Amazon EMR often process data in Amazon S3. We sometimes need to move large quantities of data between buckets or regions. In such cases, large datasets are too big for a simple copy operation.
Hadoop is optimized for reading a fewer number of large files rather than many small files, whether from S3 or HDFS. In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework.
Amazon EMR offers a utility named S3distCp which helps in moving data from S3 to other S3 locations or on-cluster HDFS. S3DistCp can be used to aggregate small files into fewer large files of a size that we choose, which can optimize the analysis for both performance and cost.
S3DistCp is faster than DistCp
S3DistCp is an extension of DistCp with optimizations to work with AWS, particularly Amazon S3. S3DistCp copies data using distributed map–reduce jobs, which is similar to DistCp. S3DistCp runs mappers to compile a list of files to copy to the destination. Upon completion of the mappers compiling a list of files, the reducers perform the actual data copy. The main optimization that S3DistCp provides over DistCp is by having a reducer run multiple HTTP upload threads to upload the files in parallel.
Problem Statement
In our use case, we need to copy a large number of small files ranging from 200-600 KB, the time taken to process these 9 lakhs files was more than two hours. Our EMR setup was a 6-node cluster setup with 1 master and 5 core nodes to parallelize the processing of the data. We tried an approach of splitting data into multiple splits of input paths in lists and parallelize. But still, the smaller file issue existed, and job took longer time.
Aggregation with file-based pattern
To overcome the problem of large number of smaller files, we found a solution to aggregate these small files to fewer larger files based on their timestamp using S3DistCp.
The same cluster is used for the process where we used EMR step feature to submit the S3DistCp job.
Below are the sample of many numbers of small files for a single day present in a S3 bucket.
So, we launched an EMR cluster and using add Step we can leverage S3Distcp, and the Jar location is in built location (command-runner.jar). The S3Distcp command takes source path, destination path and target size of the aggregated file in the destination path and the aggregation Regex in the group By. The recommended approach for the target file size is keeping files larger than the default block size, which is 128 MB on EMR.
Below is the sample code format snippet
s3-dist-cp –src <source_path> –dest <destination_path> –targetSize <target_file_size> –groupBy <REGEX>
Below is the snippet of an example run, where the EMR step completed successfully
Upon completion of the S3Distcp run, below snippet shows that for each day the destination path contained single large files of the input.
The same approach can also be achieved using the AWS CLI EMR add-steps command.
aws emr add-steps –cluster-id <id> –steps ‘Name=<name>, Jar=command-runner.jar, ActionOnFailure=<action>, Type=CUSTOM_JAR, Args= s3-dist-cp –src <source_path> –dest <destination_path> –targetSize <target_file_size> –groupBy <REGEX>
S3DistCp is a magic tool to optimize the raw files of different sizes and selectively copy different files between locations.
If you have any questions or suggestions, please reach out to us at contactus@1cloudhub.com
Written by : Dhivakar Sathya Sreekar Ippili & Umashankar N