In hive, we can create two types of tables
- · Managed table
- · External table
By default the hive stores the data in the hive warehouse
directory. When we create a table in hive, a directory corresponding to the
table will be created in the hive warehouse directory. Hive warehouse directory
is a location in hdfs where the hive stores the data of all the tables that we
create in hive without specifying any location. By default the location of the
warehouse directory is /user/hive/warehouse.
We can modify this location globally by modifying this property with a
different value in the hive-site.xml.
We can point a hive
table to any other location in hdfs rather than the default storage location.
The main difference between external and managed tables is that if we drop a
managed table, the table as well as the data will be deleted but if we delete
an external table, only the table will get deleted, data will not be deleted.
External tables will be very useful in scenarios where we
need to share the input data between multiple jobs or users.
Suppose a workflow with A as input of processes B, C and D.
B is a hive job, C is a mapreduce job and D is a pig job. Here if we use
managed hive table, when we use managed table for B, while loading data it will
move the data from A’s actual location to the warehouse directory. So when the
other processes C and D tries to access the data, it will not be present in the
actual location. If the user drops the table at the end of the process B will
delete the input data which may not be feasible in this situation.
Sample DDL for creating a managed hive table is given below.
create table details (id int, name string) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’;
Sample DDL for creating an external table
create external table details_ext(id int, name string) row format
delimited fields terminated by ‘,’ lines terminated by ‘\n’
location ‘/user/hadoop/external_table’;
The location specified in the external table can be any
location in hdfs. You can avoid the ‘lines
terminated by’ part in the DDL because the default value is ‘\n’.
No comments:
Post a Comment