Data Science and Artificial Intelligence

Introduction to Natural Language Processing

NLP is the area of computer science and artificial intelligence with the interaction between machine and human(natural) language. It is the use of programming and maths to do language-based tasks. NLP is used to apply machine learning models to text and language. The aim of NLP is to teach machines to understand what is said in spoken and written words. 

For instance, if we are apple or android users we must be aware of the virtual assistant ‘Siri’ and ‘Google Assistant’. These are the applications that take whatever we say and turn it into something meaningful that can be done programmatically. 

Free Step-by-step Guide To Become A Data Scientist

Subscribe and get this detailed guide absolutely FREE

Applications of NLP

There are several applications of NLP that we also use in our day to day life. Some of them are listed below:

  • Voice Recognition: NLP plays a significant role when it comes to the area of voice recognition. It is a software that understands human language, converts it into machine-understandable form and the humans are then provided with a resolution.

Virtual assistance is one such example of speech recognition. If we are an android or an ios user we must be familiar with google assistant or Siri.

  • Sentiment Analysis: this is a type of measure that can tell us whether a text has a positive or negative polarity. With the use of sentiment analysis, we can classify a user’s review as positive or negative.
  • Machine Translation: this is another very helpful application we have got because of NLP. Machine Translation is a model that translates information from one language to another without any human intervention. Google translate is one such application that uses this model. 
  • Spam or Ham: this is another application of NLP where the model tells whether the email is spam or not spam.  
  • Part-of-Speech tagging: this is one of the applications of NLP where parts of speech like noun, pronoun, verb, adverb, etc are classified respectively from a sentence.

Also in the near future, we hope to see more amazing applications.

Approaches of NLP

There are some of the approaches we go about in NLP that are used in data cleaning. These approaches include:

  • Removal of punctuations: punctuation marks like ‘ .,#@!” ’ don’t add much meaning to the data and increase the word count. Therefore removing punctuation is the first step of data cleaning in NLP.
  • Tokenization: this is the process of splitting a sentence into individual words. This is also one of the processes of data cleaning in NLP where after splitting a sentence we convert each word in lowercase to maintain consistency in data.

However, capitalization and de-capitalization depend on the applications we would be working on.

Eg: if we have a TextThis is a blog on Natural Language Processing

After tokenization it becomes → “this”, “is”, “a”, “blog”, “on”, “natural”, “language”, “processing”

  • Removal of stopwords: stopwords are the words which include common language articles, pronouns, and prepositions such as “ a, and, the, is, for, on” etc. These words again don’t provide much value to the NLP objective as these aren’t much informative. Therefore these words are filtered and are excluded from the text that has to be processed. Thus freeing up database space and improving processing time.
  • Stemming: it is the process of reducing a word into its base/root form. It removes the prefixes or suffixes of words based on common ones. This is another step of data cleaning. This can be helpful sometimes but may not always because oftentimes stemming might convert a word too much into its root that it loses its actual meaning. 

Eg: Playing → Play, Caring → Car, News → New

As we can see in the above example ‘Playing’ got converted into ‘Play’ which is correct, but ‘Caring’ and ‘News’ got converted into ‘Car’ and ‘New’ which is wrong as the word lost its actual meaning.

  • Lemmatization: it is a comparatively similar approach as stemming but it also involves some vocabulary and morphological analysis. Lemmatizing maps common words into the base form of the word also known as the lemma. Unlike stemming it always returns a proper word that can be found in the dictionary. Both of these processes have their respective significance on the data. 

Eg: Running, Ran, Runs → Run

In the above example, we can see that running, ran, runs are all forms of the word Run, so Run is the lemma of all the previous words.

So as we might have pictured lemmatization is more of a useful process than stemming. However since it requires more knowledge about the language, it demands more computational power.

  • Bag of Words: After cleaning the data we create a bag-of-words model. It is a simplifying model that is used in Natural Language Processing. In this model, a text is represented as the bag of its words. In this model, the occurrence of each word is used as a feature. We use the tokenized words for each observation and find out the frequency of each token.

Let’s take an example to understand this concept in depth.

Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles Dickens,

“It was the best of times”,

“it was the worst of times”,

“it was the age of wisdom”,

“it was the age of foolishness”

We will treat each line as a separate document and will make a list of all unique words from all these documents excluding the punctuation. We get:

‘It’, ‘was’, ‘the’, ‘best’, ‘of’, ‘times’, ‘worst’, ‘age’, ‘wisdom’, ‘foolishness’

Now, the next step is to create vectors. Converting text into vectors can be used by the machine learning algorithm.

Considering the first document, we will check the frequency of words from the above unique words.

“it” = 1

“was” = 1

“the” = 1

“best” = 1

“of” = 1

“times” = 1

“worst” = 0

“age” = 0

“wisdom” = 0

“foolishness” = 0

Rest of the documents will look as follows:

“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

“It was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

“It was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

“It was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

This process of vectorizing is known as Bag of Words.

This brings us to the end of this blog. In our next blog, we will discuss each of these data cleaning processes with the help of examples in detail.

So far we have just got the idea of some processes that take place during data cleaning in NLP and also the applications of NLP.

Do leave a comment for any query or suggestions. Keep visiting our website for more blogs on Data Science and Data Analytics.

Mitali Singh

Python|| Machine Learning|| Statistics|| Data Science

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close