Data Science and Artificial Intelligence

Word Cloud using Python

In this blog, we will read about the implementation of the word cloud in Python. But before that, we need to understand what ‘Word Cloud’ actually is. 

Word Cloud is a cloud filled with a lot of words of different sizes where the size of each word represents its frequency or importance. Word clouds are widely used for analyzing data from social network websites. It is a visualization technique of text analysis in which we can mask our word cloud into any shape of our choice. 

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

In this blog, we will be using the wine review dataset from Kaggle and will create a basic word cloud from one to several text documents.

In this dataset, there is a collection of a lot of wine reviews for which we will create the word cloud. 

For generating word cloud in Python we will be using the following libraries:

  • pandas
  • matplotlib
  • wordcloud

So if you don’t have these libraries installed in your system, you can install it by writing the below commands in the Anaconda prompt.

pip install pandas

pip install matplotlib

pip install wordcloud

Now let us begin our coding in Python

First import all the necessary libraries

import pandas as pd
from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt
%matplotlib inline

Loading the wine dataset and getting the first few records.

df = pd.read_csv(r"winemag-data-130k-v2.csv", index_col=0)
df.head()

Getting records for only ‘Country’, ‘Description’, ‘Points’ column.

df[["country", "description","points"]].head()

To make comparisons between groups of a feature we can use groupby() function and compute the summary statistics. Here we are grouping the country column.

# Groupby country
country = df.groupby("country")

# Statistical summary of all countries
country.describe().head()
#selects the top 5 highest average points among all 44 countries
country.mean().sort_values(by="points",ascending=False).head()

Now since we are done with the EDA, we will now dive into the main topic i.e., the Word Cloud.

To check the docstring of this function type in the below command and run it to get all the information.

?WordCloud

You might have noticed that the only required argument for a word cloud object is the ‘text’  while all others are optional.

We will now start with a simple example that is using only the first observation description as the input for the wordcloud. This involves mainly three steps:

  • Extracting the review i.e., text document
  • Creating and generating a wordcloud image
  • Displaying the cloud using matplotlib
# Start with one review:
text = df.description[0]

# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

The above-generated image is the word cloud and is the default one. As we can see in the first review words like dried and aromas have the most frequency.

We can also change the arguments like font size, background color, etc by below code:

wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Changing ‘font-size’ is not a good idea as it makes it more difficult to see the difference between word frequencies. Therefore it is suggested to keep it to the default one. However, brightening the background made the cloud easier to read.

Wordcloud also provides the option to save the image in our system by using the ‘to_file’ function. For this, we need to create a folder in the same path where our python file is located and in the code below we will give the name of that specific folder. 

# Save the image in the img folder:
wordcloud.to_file("img/first_review.png")

Now we will combine all wine reviews into one big text and create a big fat cloud to see which characteristics are most common in these wines.

text = " ".join(review for review in df.description)
print ("There are {} words in the combination of all review.".format(len(text)))
# Create stopword list:
stopwords = set(STOPWORDS)
stopwords.update(["drink", "now", "wine", "flavor", "flavors"])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

It seems like black cherry and full-bodied is the most mentioned characteristics in the given dataset. As mentioned in the beginning we can mask the word cloud into any shape our choice.

This brings us to the end of the blog. Hope you find this blog helpful. Do leave us a comment for any query or suggestion.

Keep visiting our website for more blogs on Data Science and Data Analytics.

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close