Let’s have a look at the new technique to copy files in your local file system into HDFS and vice versa using HDFS-Slurper.
HDFS-Slurper is the technique developed by Alex homes, the writer of Hadoop in practice book which is used to automate file copy from your local file system to HDFS and vice versa.
You can download hdfs-file-slurper from the below link
Download the tar file from the above link, and untar it using the command
tar -xzf hdfs-slurper-0.1.8-package.tar.gz
Now you can see that a folder with name hdfs-slurper-0.1.8 has been created in the location where you have untared.
Open that folder and you will find conf directory. In this directory, you will find a file called slurper-env.sh in which you need to provide the path to $HADOOP_HOME/bin directory which is as follows.
Note: For doing this procedure, please ensure that you have Hadoop installed in your system and the HDS daemons are up and running.
File copy from LFS to HDFS
Now, in the same conf directory, you will find another file called slurper.conf file. Here you need to provide the configurations like the location of the LFS where the files will be present and the HDFS location to where your files should get copied.
Above are the variables which will stores your configurations. Based on these configurations files will get copied.
DATASOURCE_NAME is used to store the name of the PID and the log_file.
SRC_DIR is the directory location where your data will be present in your local file system. The files which are present in this directory will be copied to HDFS.
WORK_DIR is the directory where your files will get copied from SRC_DIR, before they get copied into HDFS.
COMPLETE_DIR is the directory which stores the files that are successfully copied into HDFS. Files from the WORK_DIR will be moved here after successful completion.
REMOVE_AFTER_COPY can be set to true or false. If it is set to true, files that were successfully copied into HDFS will not be present in the COMPLETE_DIR. If it is set to false, after the successful copy of the files, the HDFS files will be present in the COMPLETE_DIR.
ERORR_DIR is the directory which stores the error records that occur while copying the files from LFS to HDFS.
DEST_STAGING_DIR is the directory present in the HDFS. Files before copying into the destination directory will get copied into this directory.
DEST_DIR is the destination directory in HDFS. In this location only, the files will get copied.
Now please set the directories accordingly and our configurations are as below.
Note: Please make sure that you have given the complete paths for the files both in LFS and HDFS. For example, in LFS path will be like this file:///home/kiran/Desktop/slurper/src and in HDFS path will be like this DEST_DIR = hdfs://localhost:9000/slurper/dest
Before copying, please make sure that the directory exists in your HDFS as shown in the below screen shot by using the command hadoop fs -ls /path_to_dir
Now we are ready to start the process. For that follow the below mentioned procedure. First create a tmp directory for slurper as shown below.
Sudo mkdir -p /tmp/slurper/in
Now create an empty file in that in directory.
sudo touch /tmp/slurper/in/sample_file.txt
Now let us navigate to the slurper installed directory
Now use the below command to start hdfs-slurper
bin/slurper.sh –config-file conf/slurper.conf
Now if you copy anything into the SRC_DIR then the file will get automatically copied into the DEST_DIR which is present in your HDFS.
Now we have copied one sample file into the SRC_DIR and that will be copied into HDFS within seconds.
In the above screen shot you can see a file has been kept in the src directory. This file will be moved into COMPLETE_DIR soon after the file is copied into HDFS. All these things will happen within seconds.
Now let us check for the file in HDFS DEST_DIR.
In the above screen shot you can see that the file has been successfully copied into the DEST_DIR. Now we will check the COMPLETE_DIR for the same.
We have given false as the parameter to REMOVE_AFTER_COPY. So, file should be present in the COMPLETE_DIR after a successful copy. You can check the same in the below screen shot.
So we have successfully automated the file copy from LFS to HDFS. Now if you assign the target directory as the SRC_DIR to any data source like Kafka or flume anything else, those data sources will fetch the data and store it in the SRC_DIR. hdfs-slurper will automatically move those files into HDFS.
File copy from HDFS to LFS
Now let us see how to automate a file copy from HDFS to LFS. The procedure is the same as that of LFS to HDFS. The only change is that you need to change the below directory paths.
We have created one more file called slurper1.conf file in the same $hdfs_slurper_HOME/conf directory. We have changed the configurations as shown below.
Let us create the directories in HDFS.
So we now have all the directories ready to copy the files from HDFS to LFS. Let us start this new configuration file of slurper.
Note: Make sure that you have created the destination and staging directories in your local file system and you have provided the correct paths in the conf file.
Now let us copy one sample file into the SRC_DIR which is present in HDFS.
At the time of copy, you can check out the console. It will generate log as shown below.
[WorkerThread-1] INFO com.alexholmes.hdfsslurper.FileSystemManager WorkerThread-1 – File copy successful, moving source hdfs://localhost:9000/slurper/work/bulk_data.tsv._COPYING_ to completed file hdfs://localhost:9000/slurper/complte/bulk_data.tsv._COPYING_
[WorkerThread-1] INFO com.alexholmes.hdfsslurper.WorkerThread WorkerThread-1 – Copying source file ‘hdfs://localhost:9000/slurper/work/bulk_data.tsv’ to staging destination ‘file:/home/kiran/Desktop/slurper/staging/352928865’
[WorkerThread-1] INFO com.alexholmes.hdfsslurper.WorkerThread WorkerThread-1 – Local file size = 38, HDFS file size = 38
[WorkerThread-1] INFO com.alexholmes.hdfsslurper.WorkerThread WorkerThread-1 – Moving staging file ‘file:/home/kiran/Desktop/slurper/staging/352928865’ to destination ‘file:/home/kiran/Desktop/slurper/dest/bulk_data.tsv’
[WorkerThread-1] INFO com.alexholmes.hdfsslurper.FileSystemManager WorkerThread-1 – File copy successful, moving source hdfs://localhost:9000/slurper/work/bulk_data.tsv to completed file hdfs://localhost:9000/slurper/complte/bulk_data.tsv
Now after the successful copy, you can check the dest directory for the data.
You can see that the file got successfully copied into your local file system. So we have successfully automated the file copy from HDFS to LFS with this. If you give your output directory to your MapReduce programs as the SRC_DIR of your HDFS, then the output of your map reduce programs will get copied into your local file system automatically. You don’t need to retrieve the output using the Hadoop get command.
We hope you have understood the procedure to copy files from LFS to HDFS and vice versa using hdfs-slurper. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.