In this post, we will be looking at how to query files in HDFS using Apache drill. We recommend you to go through our previous post on Installing Apache Drill before going ahead with this post.
Note: Drill and Hadoop should be pre-installed in your system.
Begin your Drill session by moving into the bin folder of your Drill installed directory, and type ./drill-embedded.
Next, open your browser and type localhost:8047 to go into Drill’s Web UI. Now, click on ‘Storage’ and enable ‘dfs’ and click on ‘Update’.
Now, in the configurations, add the port of your of your HDFS location. The configuration looks as shown below:
Here, we have configured port for HDFS as hdfs://localhost:9000. After configuring, click on ‘update’ and come back to the home page.
Now, open the terminal in which Drill is running and type use dfs to change the storage location to HDFS.
Now we can query the files in HDFS using Drill.
Drill supports the following file types:
- Plain text files:
- Comma-separated values (CSV, type: text)
- Tab-separated values (TSV, type: text)
- Pipe-separated values (PSV, type: text)
- Structured data files:
- Avro (type: avro)
- JSON (type: json)
- Parquet (type: parquet)
Configuration for these files should be given in the configurations of the storage plugin. By default, the configurations are mentioned for the above file formats. We can also define our own file format configuration in the HDFS storage plugin.
Now, let’s query a CSV file present in our HDFS. We have Olympic dataset separated by comma.
You can download the dataset from the below link:
Olympic Data set
Let’s select the first 10 rows of the dataset using the below command:
select * from dfs.`olympix_data.csv` limit 10;
Note: To query the files in HDFS, we need to give the path as follows dfs .
path of the file in HDFS
In the dataset, the first column is Name of the athlete and the second column is age of the athlete. Now, let’s perform a query to select the maximum age of the athletes who participated in the Olympics using the below command.
select MAX(columns) from dfs.`olympix_data.csv`;
In the above screen shot, we can see that the maximum age of the athletes who had participated in Olympics is 61.
Hope this post has been helpful in understanding how to configure Drill to query on files in HDFS . In case of any questions, feel free to comment below and we will get back to you at the earliest.
Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.