Friday 2 January 2015

Managed and External Tables in Hive

In hive, we can create two types of tables
  • ·         Managed table
  • ·         External table
By default the hive stores the data in the hive warehouse directory. When we create a table in hive, a directory corresponding to the table will be created in the hive warehouse directory. Hive warehouse directory is a location in hdfs where the hive stores the data of all the tables that we create in hive without specifying any location. By default the location of the warehouse directory is /user/hive/warehouse. We can modify this location globally by modifying this property with a different value in the hive-site.xml.

 We can point a hive table to any other location in hdfs rather than the default storage location. The main difference between external and managed tables is that if we drop a managed table, the table as well as the data will be deleted but if we delete an external table, only the table will get deleted, data will not be deleted.
External tables will be very useful in scenarios where we need to share the input data between multiple jobs or users.

Suppose a workflow with A as input of processes B, C and D. B is a hive job, C is a mapreduce job and D is a pig job. Here if we use managed hive table, when we use managed table for B, while loading data it will move the data from A’s actual location to the warehouse directory. So when the other processes C and D tries to access the data, it will not be present in the actual location. If the user drops the table at the end of the process B will delete the input data which may not be feasible in this situation.
Sample DDL for creating a managed hive table is given below.

create table details (id int, name string) row format
delimited fields terminated by ‘,’ lines terminated by ‘\n’;

Sample DDL for creating an external table

create external table details_ext(id int, name string) row format
delimited fields terminated by ‘,’ lines terminated by ‘\n’ 
location ‘/user/hadoop/external_table’;

The location specified in the external table can be any location in hdfs. You can avoid the ‘lines terminated by’ part in the DDL because the default value is ‘\n’.

No comments:

Post a Comment

How to check the memory utilization of cluster nodes in a Kubernetes Cluster ?

 The memory and CPU utilization of a Kubernetes cluster can be checked by using the following command. kubectl top nodes The above command...