All About NLP
What is NLP ?
NLP stands for Natural Language Processing, which means the machine involved will process the Natural Language we speak i.e., nothing but interpret/translate whatever we speak.
Applications of NLP
- Document Classification — Read document which is a collection of lot of text and classify the document into a specific category.
- Email — Checking if it is Ham or Spam.
- Chatbots
- Machine Translations
- Topic/Document Modelling/Summarization — where it’ll provide summary or inshot after reading entire document/topic. Eg: INSHOT app might be using it for reading all news and providing just summary of news.
- NLP is majorly into Classification problem and less of Regression.
- NLP problems are much closer to human touch.
- ROI for NLP is huge as people cannot define rules for an image or a document.
- For example: Tweets for a business i.e., let us say you run a marketing campaign on twitter with a hashtag winter collection. If it becomes famous there will be lots of posts with the same hashtag. Now, a simple insight as to how many +ve/-ve tweets are happening with that hashtag (assuming no sarcasm) is a highest insight for the business.
How do we solve an NLP Problem ?
- Prior to NLP we used to do Label Encoding/One Hot Encoding of strings. The problem with this in the dataframe the sequence will be missed if we do label encoding for strings from a document, if you write words from a document into different columns in a dataframe.
- By design Label Encoding/One Hot Encoding are independent i.e., Green or Red values in a column are independent and not dependent on each other. Whereas in the sentence ‘I am going to Market’ all the word are dependent as the words are used based on context.
How to represent Words in Vectors/Vector space model ?
- The process of converting words to vectors or vector space model is calling Embedding.
- In Vector space each word is a vector.
Data Representation Methods:
- Embedding : Embedding is of two types A) Frequency based Embedding B) Prediction based Embedding.
A) Frequency Based Embedding: To define any frequency based embedding you need to define what is a document, token, corpus & vocabulary.
- Document is basically the lowest level of granularity on which you are doing analysis i.e., let us say you’ve to classify tweet as +ve/-ve. Then each tweet becomes a document. Similarly if you’ve to classify if a person is optimistic/pessimistic based on tweets then collection of all the tweets made by the person becomes the document. So, based on the problem your document can be a single tweet/multiple tweets/a page/a document/a sms/collection of sms.
- Token is nothing but a word also called as Term/Word.
- Corpus is basically the entire collection of data i.e., let us say you’ve 3 documents of 100 pages each and you’ve to send to a specific department based on the type of document. Then each document of 100 pages is called document and all 3 documents of 100 pages are called Corpus.
- Vocabulary is nothing but all unique tokens/words in a corpus. Set of unique words/token of a corpus is called Vocabulary. For example, medical documents/corpus can have a different set of words when compared to that of a move documents/corpus.
Each word → Token → Collection of Token → Document → Collection of Document → Corpus → Set of Corpus words → Vocabulary
Term Document Matrix Vectorization of words in NLP is always dependent on the documents i.e., your 2D matrix will always have documents and terms on x and y axis. A term or document can be a vector i.e., data in an entire row or entire column is a vector.
A) Frequency Based Embedding: Here we’ll discuss 3 models
- BOW Model(Bag Of Words)
- Count Vectorizer Model
- TFIDF (Term Frequency — Inverse Document Frequency) Model
a) BOW (Bag Of Words) Model
Sentence1: Today we start with NLP and tomorrow will continue with NLP
Sentence2: Tomorrow we will continue with NLP
Disadvantages of BOW: Frequency is not given, no context information, sequence of words is missing (as positional reference is not given).
b) Count Vectorizer Model
Sentence1: Today we start with NLP and tomorrow will continue with NLP
Sentence2: Tomorrow we will continue with NLP
Frequency(T5,D1) == 2
NOTE: Marked 2 in red for with because with is called stop word as this is a common word which can be repeated in most of the sentences or more times even in a single document. Hence, marked in Red so that we can remove it from our vector data. Stop Words are defined in the NLTK library like “the”, “is”, “in”, “for”, “where”, “when”, “to”, “at” etc.
Challenges of BOW/CV models are: positional information is missing, context is missing, grammar is missing.
How to handle sparse data i.e., number of features increasing i.e., number of words increasing in a matrix ?
If data is sparse then Document Term Matrix becomes unstable. So, for handling sparse data we’ve to perform the below:
- Stop words removal
- Lemmatization → Takes any word and goes to the root word i.e., if cats is the word then root word is cat. Similarly if went is the word root word is go, beautifully is the word beauty is the root.
NOTE: Though Lemmatization causes loss of information that is not considered for Frequency Based Embedding.
- Stemming → It doesn’t convert a word to it’s root word.
Rules in Stemming:
Rule1: If my word ends with “ed” then cut that ed part. (though this doesn’t work for every word. For example you cannot cut ed in red or died as it works in trespassed. But still that doesn’t cause any issue as all the r’s which are trimmed down from red can be later replaced with red).
Rule2: If my word ends with ‘y’ it removes ‘y’ from it. For example: it makes try as tr, party as part
Ex: I went to a party tonight, it was very intense.
Upon applying stemming makes it: I, go, to , a , part, tonight, it, is, ver, instense.
Upon applying stop words removal it becomes: part, tonight, ver, intense
- Case sensitivity (convert it to lower or upper)
NOTE: You cannot use Stop Words Removal incase of Sentiment analysis or Chatbots.
NOTE: You cannot do both Stemming and Lemmatization on a document either of them should be performed.
Difference between Lemmatization & Stemming
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.
Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming. You also had to define a parts-of-speech to obtain the correct lemma.
c. TF-IDF (Term Frequency — Inverse Document Frequency) Model : The Term frequency of a particular term in a particular document is divided by the total terms in that entire document * Value
Value is nothing Inverse Document Frequency i.e., nothing but log[(1 + total number of documents)/(1 + number of documents with that particular term) +1]
Ex: Sentence1: Have go class tomorrow
Sentence2: Tomorrow Monday meeting morning
TF-IDF(Have, Sentence1) = 1/4 * log[((3)/(2)) + 1]
TF-IDF(Tomorrow,Sentence1) = 1/4 * log[((3)/(3)) + 1] = 1/4 * log[2]
For example: Let us say in a document of the words cancer, present, brain. Frequency of word Present is a lot more then it diminishes the importance of other words like cancer & brain. So, we use TF-IDF where we diminish the popularity of the word present by multiplying Term Frequency with Inverse Document Frequency
TF is similar to CV and it diminishes the importance of other words in the document. Hence we diminish it’s value by multiplying it with IDF
Range of TF is 0–1 and IDF is >=0
NOTE: we’ve to represent documents on rows as they may share similarities unlike terms
NOTE: Let us say you’ve prepared a Document Matrix and based on the training data and there are new tokens in the test data which are not present in the Document Matrix. Then you can add those tokens to your Document Matrix, but once we do that there’ll be tokens in Document Matrix which will not be present in Training data, so which calculating TF -IDF the value will become either 0, so keep an extra one in both numerator and denominator of the formula to handle these kind of issues.
TF-IDF = log(((1+N)/(1+df(t)))+1)
Normalized Term Frequency can be represented as — -> (tf) * log(((1+N)/(1+df(t)))+1) where tf is normalized term frequency, numerator of the formula is total number of docs and denominator is
NOTE: The model in Frequency based embedding is independent of feature creation, there is no dependency between the words we’ve captured or in other words we did not capture any dependency amongst the words.
B) Prediction Based Embedding:
- Unsupervised way: Model a word as a function of Context Words (unsupervised way)
NOTE: No supervision of words in unsupervised way
- Supervised Way: Sequential model (RNN/LSTM/GRU) learn embedding specific to the problem being solved. This follows a sequential approach where there’ll be relationship between each and every word.
For example: If x1,x2,x3,x4 are the words then x2 = f(x1), x3 = f(x2,x1), x4 = f(x3,x2,x1) i.e., value of x2 is dependent on x1, value of x3 is dependent on x2,x1 & value of x4 is dependent on x3,x2,x1.
- Recurrent Neural Network : RNN is used for capturing data in a sequence. The inputs of RNN are time dependent.
For example: Let us say you’ve 5 companies in your mutual fund portfolio. Taking the last 10 days stock price of these companies you want to predict whether your portfolio value increases or decreases. You are asked to predict data for 1st Jan based on last 10 days data then you’ll have a matrix with 5*10. Similarly if you are asked to predict data for 20th Jan as well based on last 10 days data then your matrix shape would be 2*(5*10). Here it is 2 because you’ve data from Jan 1st & Jan 20th.
Steps in RNN:
- Converting the words in a sentence to vectors (Embedding Layer) i.e., try to identify frequency of words in a sentence and align them descending order of frequency & alphabetical order starting with Index=1.
- You can either take all the words in a sentence or keep a cap on it. For example, if you keep a max limit of allowing only 20 words per sentence and your 1st sentence has 38 words then only the first 20 or last 20 words in sequence from Step 1 will be considered. If your 2nd sentence contains only 15 words then for the remaining 5 words you need to do padding (ideally 0’s) either at the beginning or at last.
- Now, we’ve all the words in integer representation, but we can’t consider all the words as it is. So, we’ll introduce something called as abstract ideas. For example: if you’ve words like coffee, juice, tea, milk, alcohol, etc., it doesn’t make sense in training model with all these words. So, we would rather group them under an abstract idea of beverage. These abstract ideas are just random words created by RNN model at runtime. If at all your data has words which doesn’t come under any of the abstract ideas that’s fine as we’ll calculate the loss and then back propagate. The model will then update it’s abstract ideas i.e., random words with those which match the given sentence and run the model again till the loss is minimized.
- Now, we’ll be plotting an embedding matrix which contains indexes from Step2 as rows and abstract ideas from Step3 in columns and those words (represented via integers) from our sentence1 will be given score between 0 to 1 based on how close it is to the abstract words present in columns.
- Now, the rows from Step4 will be considered as Timestep vectors and as input vectors (Embedding layer) for our RNN model where we define a random number as to how many neurons should be present in the Embedding layer.
- The RNN model takes a random weight, bias and output from previous timestamp i.e., if your sentence 1 is “The movie is well taken and all the characters are well portrayed.”, here the & well have occurred twice so they’ll be at index 1 & 2, followed rest of the words in alphabetical order (this is text to integer conversion from step 1), then will build an embedding matrix as in step 3 based on abstract ideas.
[1] The — 1
[2] Well — 2
[3] All -1
[4] And -1
[5] Are — 1
[6] Characters — 1
[7] Is -1
[8] Movie — 1
[9] Portrayed — 1
[10] Taken — 1
The sentence now looks like this: 1 8 7 2 10 4 3 1 6 5 2 9 for which we’ll be building embedding matrix based on abstract ideas and each row represents a word aka time step, which will be given as input, along with random weights, bias and output of previous time step.
Disadvantages of RNN: Forgets long sentences as it has short term memory — -> This because RNN has short term memory, if your sentence has lot of words and when it is back propagating it’ll forget the initial words by the time it reaches the beginning of your sentence. In other words the gradient vanishes during back propagating by the time it reaches the first word of sentence. So, ideally we prefer RNN when the number of words in a sentence are less. To handle this issue we came with LSTM (Long Short Term Memory).
NOTE: Embedding layer — -> We can use DTM (Document Term Matrix) or OHE (One Hot Encoding) as embedding layer but that doesn’t add any value as DTM will just have the count and OHE will be the same value for a word if it is repeating at any position. Hence will use Trainable Embedding layer which will learn from the model and update values in the embedding layer.
NOTE: This is a Work In Progress story and will update it with few more algorithms