Hue is a great platform that gives multiple tools access in a web browser. A set of web applications that enable you to interact with a CDH cluster, Hue applications let you browse HDFS and work with Hive and Cloudera Impala queries, MapReduce jobs, and Oozie workflows. In this Hue Tutorial, we will see the features of Cloudera Hue. Cloudera Hue is a handy tool for the windows based use, as it provides a good UI with the help of which we can interact with Hadoop and its sub-projects.
In this blog, we will go through 3 most popular tools. (Pig, Hive, and Impala). All 3 topics are described step by step execution procedure and any beginner can understand too. A prerequisite is that CDH should be running.CDH already has his tool Hue pre-configured and is free for learning purpose.
Once the CDH is running open the browser and enter the following address: localhost:8080. If this does not work try with the IP of the system instead of localhost.You can see the ip from the terminal using the command: ifconfig. By default, the port for Hue is set to 8080.
Let’s start with Pig.
Pig has recently become very popular among the Big Data market. As this scripting language gives result in very less amount of code design.
After you enter Hue UI. (Refer below screenshot) at the right top, you may find File Browser.
The page which will load is a graphical representation of HDFS. Click on New and create a new directory to create a workspace for yourself. Again create a file from the same button New to add a file as for dataset, and save it.
Once you are in file browser you may also Drag and Drop your file into the browser(from local to HDFS). All files in this file browser will be saved in HDFS.
Also, there are options which make our working very comfortable in HDFS, Like delete (Move to thrash), Actions, Search by name and upload all of it happens in few clicks of a button. No more commands for every action need to take inside HDFS from the terminal.
In below screenshot, we have uploaded dataset named Pokemon and we will see how to query on the same.
Once we have our dataset inside HDFS. we move to our Pig terminal to write Pig script.
In the top left, we can find a tab Query Editors. Inside select Pig.We will be redirected to below page. The code you see is the script written to query Pokemon dataset.
Here we can notice all the relations are defined at once and can be executed simultaneously to give the final result at single execution.
Link to dataset and script executed below for practice:
Note: Do not forget to save the data. You will find an unsaved label at right. Once saved it will go.
Refer screenshot below how to save the Pig queries
To save the script refer left Editor panel and click on save.The following name convention prompt. Give a proper name and save the script.
Later in future, you can find your Pig script in the home page saved. You can again execute or edit these queries. Although if you restart your system, you will be present always(unlike Pig GRUNT shell)
Select the Pig job and click on a play button in top right of the scripts. Once started the play button will convert to stop. Which gives us the same functionality as of ctrl+c in grunt shell.
A progress bar will be showing the starting of a program.
Once started you may find a log generated below the progress bar. Detailed status of the background process for the job such as submission to Hadoop and starting map-reduce, computing MapReduce, waiting for completion of execution(Heart Beats), counters everything can be found here.
Once job completes the progress bars turns green. The output of the query can be found below in result tab. Here I got it in log tab.
This is how anyone can run Pig query in Hue.
In the case of Pig, we loaded data in HDFS directly. But Hive uses metastore to access the data stored in HDFS. We need to load out dataset in Metastore.
This section stores the data in a tabular format which Hive can use directly from here. This facility makes much faster in data analysis tool. The query like structure allows tools to read the data which is already arranged/prefixed. Sometimes the complete table is not necessary to analyze, only a part of it which is another awesome benefit of metastore (Bucketing & Partitioning), and is widely used all over IT sector for analysis.
Below are the steps to feed data to metastore. Refer below screenshot.
Open Data Browsers at top tabs and open Metastore.Fill the details about the table. Here you can browse the dataset and select it to give the complete path to it.
This is choosing the file.
Note: dataset must be present in HDFS.The browse access is limited to HDFS.
Once everything is properly filled press next. This is choosing the delimiter. Here select the type of delimiter from the scroll box by which data will be separated while storing in Metadata.
We can also Preview the data, how it looks.Press Next.
This Is the final step, here check for data type being assigned to all the columns. By default it takes strings, if needed change the data type and press next. The data will be stored in Metastore.
It is one of the widely acceptable tools for query purpose. This tool defeats MySQL in query tool. As Big data is difficult to handle by traditional ways.
In left-hand side, we will find all the available tables created by us.
Write all your queries and press execute.
All the process will be taken in same window and logs can be seen at the same time. The result will pop in the result tab
We can save the query anytime by save as.
Note: – the highlighted part are few options to export result in different forms. Do check these options are a specialty for this Hue interface.
Similar to Pig it is also a unsaved script which needs to be saved. Maybe the title of the script is same so give description too.
We can always find our saved script in Home.
This is similar to Hive. The benefit of Impala comes from its columnar structure data. Which is even faster than Hive.
Impala and Hive share the same metastore. Hence we do not have to create or load data separately for Impala.
Although checking and confirming once won’t do any harm. Refer the below screenshot.
We can see our two tables are present and accessible with Impala. Also, on the left-hand side of the screen the same tables are listed inside the database we are accessing.
Now simply run the query and we will observe the execution time is really less than the Hive. Although the results are exactly same.
Hope you will be able to run the above tools in Hue without any trouble.
And many other tools are also integrated to Hue. Which we will see soon, how to operate them.
For more trending topics on Big Data visit Acadgild.