Data Analytics with R, Excel & Tableau

Text Mining using R

Introduction

Text Mining is generally known as Text Analytics. It is the process of collecting insight and information from a set of text-data. Text Mining is used to help the business to find out relevant information from text-based content. These contents can be in the form of a word document,  posts on social media, email, etc. Text mining technique allows us to feature the most frequently used keywords in a paragraph of texts. Word cloud, also referred to as a text cloud, which is a visual representation of text-data. The steps of creating word clouds are quite easy in R.

The ability to deal with text-data is one of the important skills of a data scientist in today’s scenario. With the onset of review websites, social media, forums, web pages, companies now have access to enormous text-data of their customers.

These data will be messy. however, the source of information, insights which can help companies to boost their businesses. That is the reason, why Text Mining as a technique well-known as Natural Language Processing (NLP)  is growing rapidly and being broadly used by data scientists. The text mining package ‘tm’ and the word cloud package (wordcloud) are available in R  for text analysis and to quickly visualize the keywords as a word cloud.

Advantages of Text Mining

Text Mining saves time and performs efficiently than human brains.

  • Text mining can help in predictive analytics
  • Text Mining used to summarize the documents and helps to track opinions over time
  • Text mining techniques used to analyze problems in different areas of business.
  • Also, it  helps to extract concepts from the text and present it in a more simple way

Text Mining can be used to filter irrelevant e-mail using certain words or phrases. Such emails will automatically go to spam. Text Mining will also send an alert to the email used to remove the mails with such offending words or content.

How does Text Mining work :

Text mining allows for understanding the text better than anything else. Text Mining technique takes words from unstructured data into numerical values. Text mining helps to find patterns and relationships that exist in a large chunk of text. Text mining generally uses machine algorithms to read and analyze text-data information. 

It will be difficult to understand the text easily and quickly without text-mining. The steps in the text mining process and making word clouds are listed below.

Let us see an example of how actually text mining works and how to create wonderful word clouds with R. Reasons you should use word clouds to present your text data.

The 5 main steps for text mining-cleaning text-create word clouds in R

Step 1: Create a text file

In the following examples, I’ll process my Acadgild Article on Artifical intelligence in (.txt) format by following the link https://acadgild.com/artificial-intelligence.txt .You can use any text you want :

Here in the below code line, we have loaded the data in filePath.

Step 2 : Install and load the required packages

Type the R code below, to install and load the required packages:

install.package(“NLP”)
install.package("tm")
install.package(“RColorBrewer”)
install.package(“wordcloud”)
install.package(“wordcloud2”)

To load the library of these packages use the library() function :

library(package_name)

To learn more about the above packages :

help(package_name)

Step 3 : Text Mining

“Text Mining is a technique that boosts the research process and helps to test the queries.”

To import the file type the following R code.

filePath <- "https://acadgild.com/artificial-intelligence.txt"

Read lines in text_file using readLine() function. And storing text-data to modified  text_file.

text_file <- readLines(filePath)

Let’s see the first few lines line or text_file by using head() function. 

head(text_file)

Here we can see the first few lines of our text file. 

Now we are using paste() function in text_file and make it a chunk and the text collapse into quotations (“ ”). And storing to text_file1.

Giving you a very small example because of the real text file is too large to show here. 

Example: “hello” “world” to “hello world”

text_file1 <- paste(text_file, collapse = " ")
head(text_file1)

Using the head() function to see few lines or the modified text_file1 document. 

As shown in the small example, Also here in the above console,  the entire text comes into the quotations.

Step 4: Cleaning the text

The text mining function is used to convert the text to lower case, to remove unnecessary white space, to remove common stopwords like ‘the’, “we”, to remove words, etc.

Let us make text_file1 to lower case using tolower() function. And assign it to modified clean_text. 

#clean_text-data
clean_text <- tolower(text_file1)
head(clean_text)

We can see the few lines of clean_text by using the head() function.

In every step, you can modify your text-data and use it in the next step for text-manipulation. 

You can also remove Punctuation and digits with removeNumbers and removePunctuation arguments.

To remove punctuations we are using gsub() function in the below code. 

Here, pattern=  “\\W’’ to remove puncations. 

replace= “ _ ” ,  we are going to replace the puncatuations by space. If we dont do so then it may make new words.  

#Remove punctuations

clean_text1 <- gsub(pattern = "\\W", replace = " " ,clean_text)
head(clean_tex t1)

Here in the above console, you can see there are no punctuations remain, and words are separated by space.  

If digits are present in your text file. You can easily remove those numbers form your text by using gsub() function. Probably not required. 

Here “\\d” to remove digits. 

#remove digits
clean_text2 <- gsub(pattern = "\\d", replace = " ", clean_text1)
head(clean_text2)

The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. Extracting this kind of words is helpful before further analyses. 

#clean the stop words
#load the required packages
library(NLP)
library(tm)

let’s see a preview of stopwords using stopwords() command. 

stopwords()

In the above console, we can see the list of stopwords. 

Lets us remove those stopwords and unnecessary words by using removeWords() function. 

#Remove stop words

clean_text3 <- removeWords(clean_text2,words = c(stopwords(),"ai","â"))
head(clean_text3)

Now let us remove single letters, by gsub() function in the code below.

#remove single letters
clean_text4  <- gsub(pattern = "\\b[A-z]\\b{1}", replace = " ", clean_text3 )
head(clean_text4)

Here, \\b[A-z] represents strings with any letter between a-z. The string can take uppercase letters as well as lower case letters and subset \\{1} says that the strings end with length one.

Here in the above console, the single letters have been removed. 

We can finally remove white spaces using stripWhitespace() function,which is a part of  tm library.

#remove white spaces
clean_text5 <- stripWhitespace(clean_text4)
head(clean_text4)

Frequency of words:

We now have a chunk of lines, and we are looking for the counting words.

And already joined various lines and made a chunk.

So frist split individual words and add space between them as split using split() function. 

#splitwords
clean_text6 <- strsplit(clean_text5, " ")
head(clean_text6)

By using head() function we can see the first few split words in our console. 

Here in the above console, we can see the split words from our clean_text6 text-data. 

Now create word_freq table and assign clean_text6 data in the table using table function. 

#frequency of words
word_freq <- table(clean_text6)
head(word_freq)

Here in the above console, we can see random words with the number of times it repeated in the article.  

By using cbind() by taking word_frew data-frame arguments and combine by columns or rows, respectively. 

word_freq1 <- cbind(names(word_freq), as.integer(word_freq))
head(word_freq1)

By using head() function we can see the first six rows by default. 

In the above console, the first six-row has been printed and showing the words with a number of times it been repeated. 

Step 5 : Generate the Word cloud

The feature of words can be illustrated as a word cloud as follow :

  • Word clouds add clarity and simplicity. 
  • The most used keywords stand out better in a word cloud.
  • Word clouds are a dynamic tool for communication.  Easy to understand, to be shared and are impressive words representation.

The required libraries for making wonderful word-cloud. 

library(RColorBrewer)
library(wordcloud)
class(clean_text6)
word_cloud <- unlist(clean_text6)

Arguments of the word cloud generator function :

  • words: the words to be plotted i.e; word_cloud where we have saved the text-data.
  • Freq: word frequencies
  • min.freq: words with a frequency below min.freq will not be plotted.
  • max .words: maximum number of words to be plotted
  • random.order: plot words in random order. If false, then words will be plotted in decreasing frequency
  • Rot.per: to adjust  proportion words with 90-degree rotation (vertical text)
  • brewer.pal: ??brewer.pal,  ?? command to see the functionality in R
  • colors: color words from least to most frequent. Use, for example, colors =“Red” for a single color or “random-dark”, “random-light”. 

Follow the below code and create wonderful word clouds: 

wordcloud(word_cloud)
wordcloud(word_cloud,min.freq = 5 , random.order = FALSE, scale=c(3, 0.5))
wordcloud(word_cloud,min.freq = 3, max.words=1000, random.order=F, rot.per=0.2, colors=brewer.pal(5, "Dark2"), scale=c(4,0.2))

library(wordcloud2)
wordcloud2(word_freq)
wordcloud2(word_freq, color = "random-light", backgroundColor = "white")
http://rpubs.com/badz11/513580
wordcloud2(word_freq, color = "random-dark", backgroundColor = "white",size = 0.5, shape = "triangle")
http://rpubs.com/badz11/513587
wordcloud2(word_freq, minRotation = -pi/20, maxRotation = -pi/20, minSize = 10, rotateRatio = 1, color = "random-dark", backgroundColor = “white”)
http://rpubs.com/badz11/513576

The above word cloud clearly shows that “will”, “artifical”, “data”, “human” and “intelligence” are the five most important words in the “Artifical intelligance” artical. 

Suggested readings

R vs Python combat

https://acadgild.com/blog/r-vs-python-combat

Keep visiting our site www.acadgild.com for more updates on Data Analytics and other technologies. Click here to learn data science course in Bangalore.

Series Navigation<< Data Manipulation using RAnalyzing USArrest dataset using K-means Clustering >>

Tags

Badal Kumar

Data Analyst at Aeon Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close