Open In App

Markov Chains in NLP

Markov chain is a mathematical model that is utilized to simulate random processes occurring over a duration of time. It consists of a set of states and the transitions between them. These transitions are probabilistic, which implies that the possibility of moving from one state to another solely depends on the current state and not on any past events. This model has extensive use in various fields such as physics, chemistry, biology, economics, and computer science.

Markov chain

            
            
    

 sunny 

 cloudy

  rainy

            
sunny

 0.8 

   0.1

   0.1

     
cloudy

0.4

  0.4 

 0.2

rainy  

0.2 

0.3 

 0.5

 sunny -> sunny -> rainy -> rainy -> cloudy -> rainy -> sunny -> sunny -> sunny -> ...

Analogy 

Markov Chains in Natural Language Processing (NLP)

They have been widely used in Natural Language Processing (NLP) applications, such as text generation, speech recognition, and sentiment analysis. In this article, we will discuss the concepts related to Markov Chains in NLP, the steps involved in using them, and provide good examples with proper explanations.



Before we dive into the application of Markov Chains in NLP, let’s review some of the key concepts related to this topic:

  1. Markov Property: A process has the Markov property if the probability of moving to a future state depends only on the present state and not on the past.
  2. Transition Matrix: A matrix that represents the probabilities of moving from one state to another state.
  3. Stationary Distribution: A distribution of probabilities that remains unchanged after a transition from one state to another.
  4. N-grams: A contiguous sequence of n items (words or characters) from a given sample of text.
  5. Language Model: A statistical model that assigns probabilities to sequences of words in a language.

Markov chain algorithm for generating sentences

To implement a Markov chain algorithm for generating sentences, we can follow a similar approach. We start by analyzing a corpus of text to determine the probabilities of transitioning from one word to another. For example, suppose we have the following sentence:



The quick brown fox jumps over the lazy dog.

We can create a Markov chain by treating each word as a state and analyzing the probability of transitioning from one word to another. For example, we might find that the probability of transitioning from “the” to “quick” is 0.5, the probability of transitioning from “quick” to “brown” is 1.0, and so on based on large corpus text data study. Once we have computed the transition probabilities, we can generate a new sentence by starting with an initial word and randomly selecting the next word based on the transition probabilities.

Implementations

Now, let’s discuss the steps involved in using Markov Chains for text generation in NLP. The steps are as follows:

Step 1: Data Preprocessing

The first step in any NLP task is data preprocessing. In this step, we clean the text data by removing unnecessary characters, converting the text to lowercase, and removing stop words.




import re
 
text = "I love cats. Cats are my favorite animal. I have two cats."
 
# Remove unnecessary characters
text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
 
# Convert to lowercase
text = text.lower()
 
print(text)

Output:

i love cats cats are my favorite animal i have two cats

Step 2: Generating N-grams

Next, we generate N-grams from the preprocessed text. N-grams are contiguous sequences of n words, where n is usually 2 or 3. For example, “the cat sat” is a 3-gram.




from nltk import ngrams
 
n = 2
 
# Generate 2-grams
n_grams = ngrams(text.split(), n)
 
# Convert to list of tuples
n_grams = list(n_grams)
 
print(n_grams)

Output:

[('i', 'love'), ('love', 'cats'), ('cats', 'cats'), ('cats', 'are'), ('are', 'my'),
 ('my', 'favorite'), ('favorite', 'animal'), ('animal', 'i'), ('i', 'have'), 
 ('have', 'two'), ('two', 'cats')]

Step 3: Building a Transition Matrix

After generating N-grams, we build a transition matrix that represents the probabilities of moving from one word to another. We calculate these probabilities by counting the number of times a particular word appears after another word in the N-grams.




import numpy as np
 
# Get unique words
unique_words = list(set(text.split()))
 
# Create transition matrix
transition_matrix = np.zeros((len(unique_words), len(unique_words)))
 
# Fill transition matrix
for i, word in enumerate(unique_words):
    for j, next_word in enumerate(unique_words):
        # Count the number of times a word appears followed by next_word
        count = 0
        for n_gram in n_grams:
            if n_gram[0] == word and n_gram[1] == next_word:
                count += 1
        transition_matrix[i, j] = count
 
# Normalize transition matrix
transition_matrix = transition_matrix / \
    transition_matrix.sum(axis=1, keepdims=True)
 
print(transition_matrix)

Output:

[[0.  0.  0.  0.  0.  1.  0.  0.  0. ]
 [0.  0.  0.  0.5 0.  0.  0.5 0.  0. ]
 [0.  0.  0.5 0.  0.  0.  0.  0.  0.5]
 [0.  0.  1.  0.  0.  0.  0.  0.  0. ]
 [0.  1.  0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  1.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0.  1.  0. ]
 [0.  0.  1.  0.  0.  0.  0.  0.  0. ]
 [1.  0.  0.  0.  0.  0.  0.  0.  0. ]]

Step 4: Generating Text

Once we have the transition matrix, we can generate new text by starting with an initial word and randomly selecting the next word based on the probabilities in the transition matrix. We repeat this process until we have generated the desired amount of text.




# Set initial word
current_word = "i"
 
# Generate text
generated_text = current_word
 
for i in range(10):
    # Get index of current word
    current_word_index = unique_words.index(current_word)
 
    # Get probabilities for next word
    probabilities = transition_matrix[current_word_index]
 
    # Select next word randomly based on probabilities
    next_word_index = np.random.choice(len(unique_words), p=probabilities)
    next_word = unique_words[next_word_index]
 
    # Add next word to generated text
    generated_text += " " + next_word
 
    # Set current word to next word
    current_word = next_word
 
# Print generated text
print(generated_text)

Output:

i have two cats cats are my favorite animal i have

Build the Markov model for a Large corpus

Step 1: Import the necessary libraries and Dataset




from datasets import load_dataset
import numpy as np
import pandas as pd
import os
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random
from datasets import load_dataset
 
dataset = load_dataset("cfilt/iitb-english-hindi")
dataset

Output:

DatasetDict({
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})

Step 2: Create an English text corpus




English = []
 
for translation_pair in dataset["train"]["translation"]:
    english = translation_pair['en']
    English.append(english.strip("\n"))
 
len(English)

Output:

1659083

Step 3: Preprocessing




def Tokenize(txt):
    cleaned_txt = []
    for line in txt:
        line = line.lower()
        line = re.sub(r"[,.\"\'!@#$%^&*(){}?/;`~:<>+=-\\]", "", line)
        tokens = word_tokenize(line)
        words = [word for word in tokens if word.isalpha()]
        cleaned_txt += words
    return cleaned_txt
 
 
Tokens = Tokenize(English)
print("number of words = ", len(Tokens))

Output:

number of words =  20930538

Step 4: Build the Markov Model




class MarkovModel:
 
    def __init__(self, n_gram=2):
        self.n_gram = n_gram
        self.markov_model = {}
 
    def build_model(self, text):
        for i in range(len(text)-self.n_gram-1):
            curr_state, next_state = "", ""
            for j in range(self.n_gram):
                curr_state += text[i+j] + " "
                next_state += text[i+j+self.n_gram] + " "
            curr_state = curr_state[:-1]
            next_state = next_state[:-1]
            if curr_state not in self.markov_model:
                self.markov_model[curr_state] = {}
                self.markov_model[curr_state][next_state] = 1
            else:
                if next_state in self.markov_model[curr_state]:
                    self.markov_model[curr_state][next_state] += 1
                else:
                    self.markov_model[curr_state][next_state] = 1
 
        # calculating transition probabilities
        for curr_state, transition in self.markov_model.items():
            total = sum(transition.values())
            for state, count in transition.items():
                self.markov_model[curr_state][state] = count/total
 
    def get_model(self):
        return self.markov_model

Step 5: Trained Model




markov = MarkovModel()
markov.build_model(Tokens)
print("number of states = ", len(markov.get_model().keys()))

Output:

number of states =  3306270

Step 6: Generate the new text




def generate_entences(markov, limit=100, start='i am'):
    n = 0
    curr_state = start
    next_state = None
    story = ""
    story += curr_state+" "
    while n < limit:
        next_state = random.choices(
            list(markov[curr_state].keys()),
            list(markov[curr_state].values()))
 
        curr_state = next_state[0]
        story += curr_state+" "
        n += 1
    return story
 
 
# Generate 10 senetences
for i in range(10):
    print(str(i)+". ", generate_entences(
        markov.get_model(), start='you are', limit=7))

Output:

0.  you are my patron in this behalf or for the present the viceroy was taken aback 
1.  you are sending to all the state govts uts have been sanctioned for micro level planning 
2.  you are replying privately to a company formed and registered under the old system in both 
3.  you are believers and most of its mantras in prose form were tested companies on occasion 
4.  you are being tested scientists this is being replaced by three separate circles under one head 
5.  you are told to shout not thy prayer o moses make for the seeking for the 
6.  you are carried on by the indian people in powerful and militant islam stands for louis 
7.  you are not a valid folder please choose the audio you want to retract the distal 
8.  you are truthful if only those who repent and believe in him he has created you 
9.  you are face towards the sacred month for the past and also by the night when 

To summarize, Markov Chains is a statistical model that allows us to model a sequence of events and predict what is likely to happen next based on what has happened before. In natural language processing, Markov Chains can be used to generate text that is similar to a given corpus, perform tasks such as sentiment analysis, and more.

The basic steps for using Markov Chains in NLP are as follows:

  1. Choose a corpus of text to use as input for the Markov Chain.
  2. Parse the text into sequences of words or characters, depending on the desired level of granularity.
  3. Build a transition matrix that represents the probabilities of moving from one state (e.g., word or character) to another.
  4. Use the transition matrix to generate new sequences of text that are similar to the input corpus.

In addition to NLP, Markov Chains have applications in many other fields, such as finance, physics, and biology. They are a simple but powerful way to model complex systems and make predictions based on limited information.

Overall, Markov Chains are a valuable tool in the data scientist’s toolkit, and anyone working with sequential data should be familiar with them. With their ability to model complex processes and make predictions based on limited information, they are an essential tool for anyone interested in machine learning and data analysis.


Article Tags :