Big Data Hadoop & Spark

Incremental Import in Sqoop To Load Data From Mysql To HDFS

This post covers the advanced topics in Sqoop, beginning with ways to import the recently updated data in MySQL table into HDFS. If you are new to Sqoop, you can browse through Installing Mysql and Sqoop and through Beginners guide to Sqoop for basics Sqoop commands.
Note: Make sure your Hadoop daemons are up and running. This real-world practice is done in Cloudera system.
Sqoop supports two types of incremental imports: append and lastmodified. You can use the –incremental argument to specify the type of incremental import to perform.
You should specify the append mode when importing a table, where new rows are continually added with increasing row id values. You must specify the column containing the row’s id with –check-column. Sqoop imports rows where the check column has a value greater than the one specified with –last-value.
An alternate table update strategy supported by Sqoop is called lastmodified mode. This should be used when rows of the source table is updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.
At the end of an incremental import, the value which should be specified as –last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify –last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import.

Test your command in Linux Here

Let’s see with an example, step by step procedure to perform incremental import from MySQL table.
Step:1
Start the MySQL service with the below command:
sudo service mysqld start
And enter MySQL shell using the below command:
mysql -u root -p cloudera

 
 
Step:2
Command to list database if already existing:
show databases;
Command to create a new database:
create database db1;
Command for using the database:
use db1;

 
Step:3
Also creating table, inserting values inside table is done using the following syntax.
create table <table name>(column name1, column name 2);
insert into <table name> values(column1 value1, column2 value1);
insert into <table name> values(column1 value2, column2 value2);

 
Step:4
Since the data is present in table of MySQL and Sqoop is up and running, we will fetch the data using following command.
Sqoop import –connect jdbc:mysql://localhost/db1 –username root –password cloudera –table acad -m1 –tagret-dir /sqoopout

As confirmation of the result, you can see in the image, the comment as Retrieved 3 records.
Step:5
Let’s checkout whether any data is stored in HDFS. This can be done by giving the following command in the terminal.
Hadoop dfs -ls /sqoopout/ 
This shows that part file has been created in our target directory. Now, by the following command we view the content inside part file.
hadoop dfs -cat /sqoopout/part-m-0000
 

This confirms the data inside MySQL has come inside the HDFS. But what if the data inside MySQL is increasing and has more number of rows present now than earlier?
The following steps will shed some light on the same.
Step:1
Let’s manually insert few extra values in mysql / acad table.

Now, the following command with little few extra syntax will help you feed only the new values in the table acad.
 
Step:2
The following syntax is used for the incremental option in Sqoop import command.
–incremental <mode>  
–check-column <column name>
–last value <last check column value>

As you can see in above image, 3 more records have been retrieved and the incremental import is now complete.
Along with message for next incremental import, you need to give last value as 10.
Step:3
Now let’s check and confirm the new data inside HDFS.
 

This is how incremental import is done every time for any number of new rows.

keep visiting our website www.acadgild.com for more blogs on Big Data ,Python and other technologies.Click here to learn Bigdata Hadoop from our Expert Mentors

Related Popular Courses:

BIGDATA AND HADOOP

COURSE ANDROID STUDIO

APACHE KAFKA TUTORIAL FOR BEGINNERS

DATA SCIENCE LANGUAGES

BIG DATA ANALYTICS COURSES BANGALORE

Hadoop

 

Tags

8 Comments

  1. Hello Prateek,
    Nice article. Thank you for the details. Can you pls clarify on how to handle the below scenarios?
    Scenario: 1
    I have a table which has 10 records which are first imported into HDFS from MYSQL. Later i have two rows updated and 3 rows newly inserted into MYSQL.
    Scenario: 2
    I have a table with a primary key but not increasing/incrementing values. How do we handle on such cases as last value cannot help in this case.
    Scenario: 3
    I have a table with just data and there is no date/timestamp available. How do we handle on such cases as lastmodified cannot help in this case.
    I am just a amateur in hadoop and sqoop. Would be much obliged if you could explain them with examples like the way you have explained above.
    Regards,
    Vignesh

    1. hi vignesh,
      have just saw this and i know its too late respond for you but might be helpful for others 🙂
      scenario 1: this can be handled through last modified mode which is shared above
      scenario 2: when there is no increment happens then why one would have to use incremental opertation… not a valid scenario
      scenario 3: yes lastModified cannot be used if the column is not data/timestamp.
      regards,
      Aziz.

  2. Hello,
    Really a very nice article. But my question is how to automate the above jobs. Not every time I can go and put the last value. I have to schedule the jobs daily on the basis of date. Is there any way by which we can automate the above jobs as we do in other etl tools such as Informatica/SAP BODS.

  3. Hello,
    for incremental loading I am getting error, asking for primary key. but screen shots above dont have it either.
    my command —
    sqoop import –connect jdbc:mysql://localhost/sqoopoff –username root –P –table employee –target-dir /sqoopimport/ –incremental append –check-column id –last-value 3;
    17/10/06 01:23:58 ERROR tool.ImportTool: Error during import: No primary key could be found for table employee. Please specify one with –split-by or perform a sequential import with ‘-m 1’.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close