Process Data Stored in MongoDB Using Pig
MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. Here we will show a demo on how to use MongoDB connector to connect to Hadoop and perform analysis using Hadoop. Using this MongoDB connector we can collect MongoDB data into Hadoop MapReduce jobs, process data stored in MongoDB and return results back to a MongoDB collection.
Now we will create a database in MongoDB and we will enter some data into the database and we will dump the data using Pig. First, let’s create a database in MongoDB using the below command.
To see the list of databases present in MongoDB the command is show dbs. You can see the same in the below screen shot. We have created a database called AcadGild.
Now to enter the data into the database, the command is
db.collection_name.insert(Json_data or variable_name)
You can insert data by taking the data into a variable first and then inserting it into the collection as shown below. db.user_details.insert(document)
Here we have our collection name as user_details.
Now we will directly load the json data without any linebreaks into the collection as shown in the below screenshot.
Let us now see these two results in MongoDB using the command. db.collection_name.find
In the above screenshot, you can see the two results. We have two records in MongoDB. Let us now load this data into pig.
For loading this data into pig, you need to download few jar files and add them into pig shell.
You can download the jar files from the below link.
Download the three jar files from the above link and add them to the pig shell as shown below.
Here we will load the data present in a collection into the pig relation raw as shown below.
raw = LOAD 'mongodb://localhost:27017/acadgild.user_details' USING com.mongodb.hadoop.pig.MongoLoader;
Here we haven’t specified any schema while loading the data, so pig relation will be created without the schema itself. The same is shown in the below screenshot.
Now you can dump this relation raw to check the output which is of json format.
In the above screenshot, you can see the output of two records which are inserted into MongoDB.
Let us now see the approach of loading the data using the schema as shown below.
raw = LOAD 'mongodb://localhost:27017/acadgild.user_details' using com.mongodb.hadoop.pig.MongoLoader('id, updated_at,created_at', 'id');
Now if we load the data from MongoDB using the above relation, only the Object_id, Created_at, Updated_at values will be loaded. The same you can see in the below screenshot.
Now you can see the output of 3 columns.
In the above screenshot, you can see the schema of the relation at the last by using the describe command and above that you can see the output of 3 columns.
We have successfully loaded data from MongoDB into pig using MongoDB-Hadoop connector.
We hope this blog helped you in understanding how to load data from MongoDB into pig to perform ETL operations on the data in MongoDB. Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies.