Friday, 2 January 2015

What is hadoop ?


Hadoop is a framework which is designed in special for handling large data. The intension behind the development of hadoop is to develop a scalable low cost framework that can process large data. The Hadoop is having a distributed file system and a distributed processing layer. This distributed file system and distributed processing layer is residing on top of several commodity machines. The team work of the commodity machines is the strength of hadoop.

The distributed storage layer of hadoop is called Hadoop Distributed File System (HDFS) and the distributed processing layer is called mapreduce.  The idea of this hdfs and mapreduce came from google frameworks such as google file system (GFS) and google mapreduce.

Hadoop is designed in such a way that it can run on commodity hardware which will reduce the cost. In other data processing frameworks, the hardware itself is handling the fault, but in hadoop, the framework itself is handling the hardware failure. Hadoop doesn't require any RAID arrangement of disks. It just requires the disks in JBOD configuration. JBOD means just a bunch of disks.

No comments:

Post a Comment

How to check the memory utilization of cluster nodes in a Kubernetes Cluster ?

 The memory and CPU utilization of a Kubernetes cluster can be checked by using the following command. kubectl top nodes The above command...