This section describes the the various components of Hadoop: parts of the MapReduce job process, the handling of the data, and the architecture of the file system.
An entire Hadoop execution of a client request is called a job. Users can submit job requests to the Hadoop framework, and the framework processes the jobs. Before the framework can process a job, the user must specify the following:
Hadoop has four entities involved in the processing of a job:
The user specifies the job configuration by setting different parameters specific to the job. The user also specifies the number of reducer tasks and the reduce function. The user also has to specify the format of the input, and the locations of the input. The Hadoop framework uses this information to split of the input into several pieces. Each input piece is fed into a user-defined map function. The map tasks process the input data and emit intermediate data. The output of the map phase is sorted and a default or custom partitioning may be applied on the intermediate data. Accordingly, the reduce function processes the data in each partition and merges the intermediate values or performs a user-specified function. The user is expected to specify the types of the output key and the output value of the map and reduce functions. The output of the reduce function is collected to the output files on the disk by the Hadoop framework.
Hadoop can work directly with any mountable distributed file system, but the most common file system used by Hadoop is the Hadoop Distributed File System (HDFS). It is a fault-tolerant distributed file system that is designed for commonly available hardware. It is well-suited for large data sets due to its high throughput access to application data
HDFS has the following features
Hadoop is designed to run on clusters of machines.
Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!