Big Data Hadoop & Spark

Querying HDFS using Apache Drill

In this post, we will be looking at how to query files in HDFS using Apache drill. We recommend you to go through our previous post on Installing Apache Drill before going ahead with this post.

Note: Drill and Hadoop should be pre-installed in your system.

Begin your Drill session by moving into the bin folder of your Drill installed directory, and type ./drill-embedded.

Next, open your browser and type localhost:8047 to go into Drill’s Web UI. Now, click on ‘Storage’ and enable ‘dfs’ and click on ‘Update’.

Drill ui

Storage plugins in drill

Now, in the configurations, add the port of your of your HDFS location. The configuration looks as shown below:

{
"type": "file",
"enabled": true,
"connection": "hdfs://localhost:9000/",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}

Here, we have configured port for HDFS as hdfs://localhost:9000. After configuring, click on ‘update’ and come back to the home page.

Now, open the terminal in which Drill is running and type use dfs to change the storage location to HDFS.

Now we can query the files in HDFS using Drill.

Drill supports the following file types:

  • Plain text files:
    • Comma-separated values (CSV, type: text)
    • Tab-separated values (TSV, type: text)
    • Pipe-separated values (PSV, type: text)

    Hadoop

  • Structured data files:
    • Avro (type: avro)
    • JSON (type: json)
    • Parquet (type: parquet)

Configuration for these files should be given in the configurations of the storage plugin. By default, the configurations are mentioned for the above file formats. We can also define our own file format configuration in the HDFS storage plugin.

Now, let’s query a CSV file present in our HDFS. We have Olympic dataset separated by comma.

You can download the dataset from the below link:

Olympic Data set

Let’s select the first 10 rows of the dataset using the below command:

select * from dfs.`olympix_data.csv` limit 10;

Note: To query the files in HDFS, we need to give the path as follows dfs .`path of the file in HDFS`

In the dataset, the first column is Name of the athlete and the second column is age of the athlete. Now, let’s perform a query to select the maximum age of the athletes who participated in the Olympics using the below command.

select MAX(columns[1]) from dfs.`olympix_data.csv`;

Querying hdfs using drill

In the above screen shot, we can see that the maximum age of the athletes who had participated in Olympics is 61.

Hope this post has been helpful in understanding how to configure Drill to query on files in HDFS . In case of any questions, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Hadoop

2 Comments

  1. Hi,
    I have a question about the Apache drill with the HDFS (MAPR FS).
    1. I have a big file of around 100GB and i would like to query that file by using drill in chunks.
    2. I have a concern that, as the HDFS uses the mapr to reduce the file and save it in different clusters.
    3. So when I queried on top of that file for 100000 records as the limit and keeping the offset to 1 i am getting different data at every execution of the query.
    4. If this happens then i will have a problem like some data might get lost or unread and some data will get duplicated and i can’t even read the entire file.
    5. Could you like to give me a clear view of how can we use it for our requirement Or can you provide us a alternative for that.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close