Sunday 4 January 2015

How to import data from RDBMS to Hadoop and viceversa ?

Hadoop became very popular within few years because of its robust design, open source and ability to handle large data. Nowadays lot of RDBMS to hadoop migration projects are happening. Hadoop is not a replacement for the RDBMS, but for certain usecase, hadoop can perform well than RDBMS. Some projects may require data from rdbms along with multiple sources for finding insights. In these scenarios, we need to transfer data from RDBMS to hadoop environment. This task sounds simple, but this is a difficult task as this involves lot of risk. The possible solutions for importing data from RDBMS to hadoop are explained below

1) Using SQOOP
Sqoop is a hadoop ecosystem component that is developed for importing data from RDBMS to hadoop and for exporting data from hadoop to RDBMS. Sqoop jobs runs as a mapreduce job. Sqoop utilizes hadoop's parallelism for doing the parallel import and export. Internally sqoop is running as a mapper alone job that utilizes jdbc. For using sqoop, we need a good network connectivity between the RDBMS environment and hadoop environment.

2) By dumping the data from database and transferring via portable secondary storage devices
Most of the companies may not allow direct network connectivity to RDBMS environment from hadoop. Another reason for not allowing is that when a sqoop job is triggered the data flow through the network will be very high which will affect the performance of other systems connected to the network. In such cases, data will be transferred to the hadoop environment by dumping the data from the database, copying it to some portable secondary storage devices or some cloud storage (if allowed) and transferring the data to hadoop environment.

No comments:

Post a Comment

How to check the memory utilization of cluster nodes in a Kubernetes Cluster ?

 The memory and CPU utilization of a Kubernetes cluster can be checked by using the following command. kubectl top nodes The above command...