In one of our previous blogs, we discussed Natural language processing(https://acadgild.com/blog/introduction-to-natural-language-processing), its application in real-time and the processes involved in data cleaning in NLP.
In this blog, we will read about the data cleaning processes involved in NLP each with an example.
Free Step-by-step Guide To Become A Data Scientist
Subscribe and get this detailed guide absolutely FREE
Before proceeding further we need to import one important and popular library used in NLP that is known as NLTK. NLTK, which stands for Natural Language ToolKit, is one of the most important and easiest NLP libraries written in Python and also has a big community behind it.
Let us see how we install and import nltk in Python.
We can install this library using pip as:
pip install nltk
To check whether it has been installed or not, we will import this library in Python terminal as:
If it gets executed successfully it means the library has been successfully installed.
After installing NLTK, we need to install the NLTK packages by running the below code:
On executing this, an ‘NLTK downloader’ window will pop-up as shown below listing a number of packages that need to be installed.
You can either select all the packages and install them at once or any of the required packages.
Since NLTK got imported, we will now see some of the data cleaning processes involved in NLP using an example.
Data Cleaning in NLP
- Removal of punctuation: punctuation marks don’t add much information to the data, hence we need to remove it. We can remove it from the below code:
Suppose we have a sentence as:
sent = "#This is a blog on - Natural Language Processing. Here, we are discussing the data cleaning processes involved in NLP"
We will perform the data cleaning on the above data:
#importing python library import string #checking each character whether it contains any punctuation nopunc = [char for char in sent if char not in string.punctuation] #joining all the individual characters into words nopunc = ''.join(nopunc) nopunc
After removing the punctuations the data looks like this:
- Tokenization: it is the process of breaking up a string into individual words.
Tokenization can be done in two ways using python as well as nltk library as below:
#using python function sent.split()
#using nltk library from nltk.tokenize import word_tokenize tokens = word_tokenize(sent) tokens
From the above two output, we can see that both functions perform similar operations. However, in the latter operation splitting of tokens has been done based on whitespaces and punctuation while in the former splitting has been done based on only whitespaces.
- Removal of Stopwords: stopwords are the common words such as ‘a’, ‘an’, ‘the’, ‘this’, etc which again don’t add much meaning to the data and should be removed. This improves the computational time and also free up the database space.
While removing stop words we also split the data and convert each word in lowercase in the process.
from nltk.corpus import stopwords no_stop = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')] no_stop
After removing the stop words and punctuations we will get the output something like this:
- Stemming: it is the process of converting a word into its root or base form. Also, the prefixes or suffixes get removed from a word that results in a word that may or may not be meaningful.
We perform stemming in Python using the below code:
from nltk.stem.porter import PorterStemmer porter = PorterStemmer() stemmed = [porter.stem(word) for word in no_stop] stemmed
From the above output, we can see that after stemming some of the words retain their meaning like ‘process’, ‘discuss’ while some of the words have lost its meaning like ‘natur’, ‘involv’. Therefore, stemming may or may not always be a useful process.
- Lemmatization: it is a process that is similar to stemming that is converting a word into its root or base form but unlike stemming it always returns a proper word that can be found in a dictionary.
Let us see how we perform lemmatization in Python:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lem_text = [lemmatizer.lemmatize(word, 'v') for word in no_stop] lem_text
In the Python code, the letter ‘v’ determines the part of speech in which the word would be converted. Therefore ‘v’ refers to ‘verb’.
Now when we compare the outputs of stemming and lemmatization we can see that during the process of stemming the words have converted into the lower cases while in lemmatization it does not. This is the reason the uppercased words in Lemmatization have been converted into its lemma i.e., ‘Processing’ remained as ‘Processing’.
However, we can explicitly convert it into lowercase. This could be one of the disadvantages of lemmatization. Apart from this, all the other words got converted into its meaningful base form.
Hence we can’t compare the above two processes as which is good or bad. Also, the choice of choosing the data cleaning processes in NLP depends highly upon the use case. A wrong choice may lead to ambiguity in data.
Thank you for reading this far. In case of any query or suggestion do leave us a comment. In our next blog, we will be executing a real-time use case using NLP.
NLP for Sentiment Analysis or building a recommender system with clustering are some of the endless possibilities in NLP.
Keep visiting our website for more blogs on Data Science and Data Analytics.