Read this article to learn, how to perform word count program using Hive scripts.
Below is the input dataset on which we are going to perform the word count operation.
This dataset consists of a set of strings which are delimited by character space.
You can refer to the screenshot below to see what the expected output should be. The resultant output consists of count for each word repeated in the input dataset without punctuations.
Let’s begin by creating a table to hold the dataset, as shown below:
Create table WordCount(Sentence string);
The above command will create a table called ‘WordCount’ with a single field named Sentence of type String
Next, let us load the input dataset into the table as shown below.
We use split command to split all the strings present in the table WordCount which are delimited by character space (‘ ‘).
Let us use explode command to shift all the split words in the subsequent rows
explode() takes in an array (or a map) as an input and gives the elements of the array (map) as separate rows for output.
Now let us use count and group by command to group similar words and perform count operation on the grouped words.
We can see in the above image that the word count operation has been successfully performed on the given input dataset though punctuation characters still exists.
Thus, to remove punctuation characters from these words we can use regex_replace command.
[^A-Za-z0-9,””] expression is used to remove all the characters other than alphabetical and numerical characters and groups these strings which does not contain any punctuations.
Now we can observe the image below is with all the punctuation characters removed as well as word count operation successfully performed.
Hope this blog helped you understand how to perform word count operation using Hive queries.
In case you have any queries feel free to contact us at [email protected]. Also, keep visiting www.acadgild.com for more updates on the courses.