Big Data Hadoop & Spark

How to Perform Word Count using Hive

Read this article to learn, how to perform word count program using Hive scripts.
Below is the input dataset on which we are going to perform the word count operation.
This dataset consists of a set of strings which are delimited by character space.

You can refer to the screenshot below to see what the expected output should be. The resultant output consists of count for each word repeated in the input dataset without punctuations.

Let’s begin by creating a table to hold the dataset, as shown below:
Create table WordCount(Sentence string);

The above command will create a table called ‘WordCount’ with a single field named Sentence of type String
Next, let us load the input dataset into the table as shown below.

Hadoop
We use split command to split all the strings present in the table WordCount which are delimited by character space (‘ ‘).

Let us use explode command to shift all the split words in the subsequent rows
explode() takes in an array (or a map) as an input and gives the elements of the array (map) as separate rows for output.


Now let us use count and group by command to group similar words and perform count operation on the grouped words.


We can see in the above image that the word count operation has been successfully performed on the given input dataset though punctuation characters still exists.
Thus, to remove punctuation characters from these words we can use regex_replace command.

[^A-Za-z0-9,””] expression is used to remove all the characters other than alphabetical and numerical characters and groups these strings which does not contain any punctuations.
Now we can observe the image below is with all the punctuation characters removed as well as word count operation successfully performed.

Hope this blog helped you understand how to perform word count operation using Hive queries.
In case you have any queries feel free to contact us at [email protected]. Also, keep visiting www.acadgild.com for more updates on the courses.

Related Popular Courses:

HADOOP TRAINING

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

BUSINESS ANALYTICS COURSE

DATA SCIENTIST TRAINING

DIGITAL ANALYTICS CERTIFICATION

Hadoop

Manjunath

is working with AcadGild as Big Data Engineer and is a Big Data enthusiast with 2+ years of experience in Hadoop Development. He is passionate about coding in Hive, Spark, Scala. Feel free to contact him at [email protected] for any further queries.

3 Comments

  1. Pingback: WordCount Probelm in Hive | HadoopMinds

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close