All CategoriesBig Data Hadoop & Spark

A Quick Guide to Advanced Apache Sqoop Operations

Advanced Apache Sqoop Operations

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data from Hadoop into structured data stores such as relational databases. Apache Sqoop can efficiently import and export data from structured data sources like MySQL and put them in Hadoop data stores like HDFS and vice versa.

In our previous blogs on Apache Sqoop, we have discussed:

  • Importing a particular table in MySQL into HDFS

  • Importing all the tables in a database into HDFS

  • Importing table data into a specific directory in HDFS

  • Importing specific table data into HDFS

  • Importing table data as a Sequence file into HDFS

  • Importing table data as an Avro file into HDFS

  • Incremental imports in Sqoop

  • Exporting data from HDFS to a MySQL table

For more details, we recommend our users to go through our previous blogs on Sqoop.

https://acadgild.com/blog/beginners-guide-for-sqoop/

https://acadgild.com/blog/sqoop-tutorial-incremental-imports/

In this tutorial, we will be discussing some advanced concepts in Apache Sqoop like

  • Password protection techniques

  • Compressing data after importing to HDFS

  • Controlling parallelism

  • Speed transfer

  • Handling null values

Password Protection Techniques

Giving your password directly in the command line while executing an Apache Sqoop command is highly insecure. For that reason, Sqoop provides two options to protect your password. Let us explore how to secure password while transferring data.

Reading Password From a Standard Input

We have the following data in our MySQL database. It contains database AcadGild which has an employee table.

Now we will export this data from MySQL to HDFS using Sqoop.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee -P

Users can use the above command to transfer data from MySQL to HDFS mentioning -P at the last to enter their password through a standard input.

In the above screenshot, you can see that Sqoop has prompted to enter a password so that no one can see what the user is typing. This is one way of securing the password.

Reading a Password from the File:

Here is how to write your password to a file

echo -n "kiran0541" >> mysql_sqoop_password

You need to use the option -n. Otherwise, a new line will be created unknowingly and while reading the password, Sqoop throws an error Access Denied for User.

File in the Local System:

In this method, we have stored the password in a file in the local system and we are giving the path of the file so that Sqoop can read the password from the file. You can also set the permission of file to 400 so that no other users can access that file.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee --password-file file:///home/kiran/mysql_sqoop_password --driver com.mysql.jdbc.Driver -m1

File in HDFS:

In this method, Sqoop will read your password from a file that is stored in HDFS cluster. You can also set the permission of file to 400 so that no other user can access that file.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee --password-file mysql_sqoop_password --driver com.mysql.jdbc.Driver -m1

Sqoop Errors:

Hadoop

If your Sqoop import command is throwing an error like this

ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set [email protected] is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

java.sql.SQLException: Streaming result set [email protected] is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

Then you need to specify the MySQL driver to use explicitly as shown –driver com.mysql.jdbc.Driver

If there is no primary key set in your table in the MySQL table, then Sqoop will throw an error as shown.

ERROR tool.ImportTool: Error during import: No primary key could be found for table employee. Please specify one with –split-by or perform a sequential import with ‘-m 1’.

In such case, you need to specify -m1 option in your Sqoop command. By including this option in your command, Sqoop will incrementally transfer one row at a time by limiting the number of mappers running in parallel to 1.

Compress Imported Data

To decrease the size of data after importing to HDFS, we can use the option –compress while executing the command.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee -P --driver com.mysql.jdbc.Driver -m1 --compress

Controlling Parallelism

By default, Sqoop uses 4 mappers running in parallel to transfer data to HDFS. We can also control the parallelism by using the option –num-mappers mappers_count.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee -P --driver com.mysql.jdbc.Driver -m1 --num-mappers 5

Speed Transfer

Sqoop can handle bulk transfers very well. You can speed up the transfers by using the –direct parameter. Normally Sqoop uses JDBC as an interface for transferring data. Alternatively, direct mode transfers the data using the native utilities of the database provider. In MySQL, mysqldump and mysqlimport are used for data transfers between MySQL. Sqoop will use pg_dump utility to import the data.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee -P --driver com.mysql.jdbc.Driver -m1 --direct

Handling Null Values

We have added few null values to our MySQL table. Let’s see how Sqoop handles these null values.

sqoop import --connect jdbc:mysql://localhost/Acadgild --username root --table employee -P --driver com.mysql.jdbc.Driver -m1 --null-string '\\N' --null-non-string '\\N'

To replace null values inside a table with \N, we need to use parameters like –null-string or –null-non-string

We need to use –null-string to override the null values of the columns with data types Char, Varchar, Nchar, Text and few others. To override the null values of other column types we need to use –null-non-string parameter

Below is the data in our MySQL table.

Now we will run the Sqoop command to transfer these tables containing null values.

Let us now see how the data is in HDFS. In the below screenshot, you can see that null value has been replaced by \N.

Hope this blog helped you understand how to import data into HDFS from MySQL. Keep visiting our site www.acadgild.com for more updates on big data and other technologies

Hadoop

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close