Introduction to Natural Language Processing – Brain Mentors Skip to content

Introduction to Natural Language Processing

NLP short for Natural Language Processing is one of the major areas of Data Science. When it comes to text processing or when you send or receive some sort of data in any text format whether it is a mail or a message or anything else then just think about it that how you are going to apply machine learning on it. A lot of applications are using NLP like Apple Siri, Windows Cortona, Amazon Alex, etc.

In this blog we are going to cover :

  • Usage of NLP
  • Text Processing Approach
  • Convert Text document to vector form
  • Implementation using Python and NLTK

Usage of NLP

Let’s understand the use of NLP with the help of a few examples :

  • Gmail (Spam Classification)

When you receive a mail on Gmail then how Gmail finds out whether this mail will go to your inbox or the spam box.

  • Sentiment Analysis

When you watch a movie and review it on any website then how my machine learning model will predict that the review is positive or negative. This kind of feature is used on online shopping websites or Google Play Store. When we write a review for any product or an app then it is classified as positive or negative based on the text or words I have used in my review.

Similarly, Gmail keeps track of each mail that what sort of text or words are used. For example :

  • Chatbots

Chatbots have killed a lot of jobs and introduced a new way of tackling the customers. When you call at customer care centers than there are 90% chances that you will be talking to a chatbot rather than a human. With the help of chatbots, AI has become more powerful. Earlier it was a job of humans to listen to  the complaints or queries and then process them and they took a salary for that. But now the same task is done by bots without any salary. A Lot of companies like Google, Apple, Microsoft, or Facebook are using these kinds of chatbots.

So, these were the few examples of NLP that how companies are using this technique to process the text data and apply machine learning to it. We need to understand how to make our machines understand the text and convert it into vector form.

Text Processing Approach

image20

Now, this is the approach towards the implementation of text data. There is a process that we need to understand and accordingly, we will get the text data converted into vectorized form. Let’s understand this approach one by one.

Tokenization

Tokenization means splitting your text data into individual tokens like this :

Sentence = “Hello John, how are you? let’s meet tomorrow for a movie, I bought two tickets”

Tokens = [“hello”, ”john”, ”how”, ”are”, “you”, “?”, “let’s”, “meet”, “tomorrow”, “for”, “a”, “movie”, “I”, “bought”, “two”, “tickets”]

Removing Stopwords

After splitting your data into tokens, we need to remove the words like is, am, are, the, that, and so on. These words are known as stopwords. We also need to remove punctuations like $, %, ^, &, *,?

Tokens = [“hello”, ”john”, ”how”, ”are”, “you”, “?”, “let’s”, “meet”, “tomorrow”, “for”, “a”, “movie”, ”I”, “bought”, “two”, “tickets”]

Updated_tokens = [“hello”, “john”, “let’s”, “meet”, “tomorrow”, “movie”, “bought”, “two”, “tickets”]

Stemming/Lemmatization

Now after removing stopwords we apply stemming or lemmatization on our words. Stemming means to convert your words into its 1st form i.e., stem.

Suppose you have words like :

Going, it becomes go, 

playing becomes play, 

bought becomes buy, 

Updated_tokens = [“hello”, “john”, “let’s”, “meet”, “tomorrow”, “movie”, “bought”, “two”, “tickets”]

Updated_tokens = [“hello”, “john”, “let’s”, “meet”, “tomorrow”, “movie”, “buy”, “two”, “tickets”]

Vectorization

Finally, we are ready to convert our text document into numbers. The technique is known as vectorization which makes a bag of words for you. We have few techniques to convert text into vectors and one of the most popular techniques is TF-IDF.

TF-IDF – Here TF means Term Frequency which calculates the frequency of each word in our document. IDF means Inverse Document Frequency which calculates the impact of a word in the document using a formula : 

Updated_tokens = [“hello”, “john”, “let’s”, “meet”, “tomorrow”, “movie”, “buy”, “two”, “tickets”]

So let’s write down the frequencies in alphabetically sorted order: 

{“buy” : 1, “hello” : 1, “john” : 1, “let’s” : 1, “meet” : 1, “movie” : 1, “tickets” : 1, “tomorrow” : 1, “two” : 1}

So I have written frequencies for each word. But remember when we apply TF-IDF using a predefined library like NLTK then our code it’s not going to show results like this. This is just for our understanding. If you want this output, then you can build your dictionary using python code.

Previously we applied TF and now after applying IDF our text document will be converted into a matrix like this : 

[ [0.23, 0, 0, 0.14, 0, 0, 0.34, 0, 0], [0, 0, 0, 0.14, 0, 0, 0.23, 0, 0, 0] ]

This sort of output we will get after applying IDF.

This is not the exact output of the document we were using. This is just for understanding.

Implementation using Python and NLTK

Now our text document has been converted into vector form and we are ready to apply machine learning on our dataset. But before talking about machine learning let us first implement it till here.

To implement text processing we need 2 packages – NLTK and scikit-learn.

We can apply all the techniques without these packages as well, but only if you are good enough in python programming.

To get started first you need to install NLTK package. Open cmd and type Command : pip install nltk

Or if you are using Jupyter Notebooks then NLTK comes installed with Anaconda environment.

Let’s get started…

  • Open Jupyter Notebook and create a new file :
image3 (1)
  • Make sure NLTK is installed and downloaded properly
  • Import the necessary packages
  • Let’s take a demo data first

So here I have taken a list and inserted a few random sentences. 

The first thing we need to do is word tokenization. This is the example of it

image5 (2)

Let’s apply word tokenization to all the sentences

Now the data is converted into a list of tokens. After that we need to remove stopwords.

What I did here is fetched English stopwords from the stopwords.words method. Also, I extended a few punctuations that I want to remove. If you want to remove all the punctuations than you can use a string package and get a list of punctuation from it.

For example :

But here I just extended my list of English stopwords.

Next, we will apply for Stemming. To apply stemming we can use a package PorterStemmer. But this package is not much accurate like WordNetLemmatizer. Let’s see the difference:

Stemming
WordNetLemmatizer
image7 (1)
image8 (1)

So we can see that WordNetLemmatizer is giving better results as compare to PorterStemmer. So, we will use it.

Now the data has been converted to its first form. Before applying the next part we will join the tokens and convert them back into sentences.

That’s how our data will look now after applying the join method with all tokens. Finally, its time to apply the vectorization part. Majorly we have 3 types of vectorization technique

  • CountVectorizer
  • TfIdfVectorizer
  • Hashing

There is a method known as fit, that we can use to fit our data in CountVectorizer. After fitting data let’s print the vocabulary.

The data is present in a dictionary and the numbers in front of each word that you are seeing here are the sorted indexes. Now there is a method known as transform that we use after fitting our data. Transform will convert data into a sparse matrix, and it contains zeros at most of the places.

This is the outcome after applying CountVectorizer. How you will understand this? 

The shape of the matrix is 6×17 because we had 6 sentences and 17 unique words available. This matrix is telling us that if we take the first sentence “match good” than you will see 1 at only those indexes where match and good words are appearing. According to our sorted indexes “match” is present at the 9th index and “good” at the 5th index. That’s why in first row 5th and 9th index are showing 1 and rest are showing 0. 

Similarly, for the rest of the rows, we are getting 0 and 1.  Now when we apply TfIdfVectorizer then the output will be a little bit different. The output will be in a normalized form. Instead of 1, we will get some value between 0 and 1.

We can use fit_transform directly instead of first fitting and then transforming data. So, let’s use this method and see the output.

This will be the output after applying TfIdfVectorizer. Instead of getting 1, we will get a normalized value where data is available.

Finally, our data is converted into the vectorized format and now we can apply machine learning on it. 

There is a video uploaded on Youtube for this blog. Visit here to watch the video.

Sign Up and Start Learning