Case Study on Storage Characteristics of Hadoop

Daisy F. Sang, Richard Yang, and Chang-Shyh Peng

Keywords

Parallel I/O, parallel computing, hadoop, MapReduce

Abstract

To process astronomical amount of data, Google developed its unique solution MapReduce. Outside of Google, Hadoop, an open source implementation of MapReduce, is the most well known implementation available to the public. This paper studies the Hadoop cluster with varying parameter values. Experiements focus on how execution time, throughput, and framework administration time are affected by the input data size, HDFS block size, and concurrent Map and Reduce tasks. Our studies show that in a cluster setting of one node, (a) Hadoop execution of Map-only tasks is efficient regardless of input data sizes, (b) the HDFS default block size of 64MB yields decent performance while the block sizes of 128MB or 256MB could be optimal given higher storage I/O such as in RAID, and (c) the highest throughput occurs when the framework has two concurrent Map tasks and two concurrent Reduce tasks.

Important Links:

Go Back