Friday 2 January 2015

What is hive ?


Hive is one of the members in the Hadoop ecosystem.  Hadoop is written in java. So initially hadoop was limited to only the subset of engineers who know java. Later, some smart guys in facebook designed a layer on top of hadoop which can act as a mediator between the SQL experts and hadoop. They made an application that will accept SQL standard queries and talks to hadoop by parsing the queries. This application is called hive. This is internally accessing the HDFS and data processing is happening through mapreduce. The main advantage is that the user doesn’t need to worry about the complexity of writing lengthier mapreduce programs. After the invention of this application, hadoop became popular among SQL experts through hive.  Another advantage of hive is that the development time for some solutions are very faster compared to writing java programs. Sometimes a few lines of queries may work well instead of writing several hundred lines of code.

In hive the data is represented as tables. A table is a representation of data with a schema. In hadoop data is stored in hdfs. So if we look at the data through a schema, we will be able to visualize the data in tabular format. In hive the schema is stored in metastore. A metastore is a lightweight database where the hive stores the metadata of tables. By default hive uses derby database, which is not suitable for production or multi-user environments. So usually people use mysql, postgresql etc as metastore.

No comments:

Post a Comment

How to check the memory utilization of cluster nodes in a Kubernetes Cluster ?

 The memory and CPU utilization of a Kubernetes cluster can be checked by using the following command. kubectl top nodes The above command...