Markov Chains in NLP

Last Updated : 31 Jul, 2023
Markov chain is a mathematical model that is utilized to simulate random processes occurring over a duration of time. It consists of a set of states and the transitions between them. These transitions are probabilistic, which implies that the possibility of moving from one state to another solely depends on the current state and not on any past events. This model has extensive use in various fields such as physics, chemistry, biology, economics, and computer science.

  • Transition matrix: The fundamental mathematical concept of a Markov chain is the transition matrix. This is a square matrix that describes the probability of moving from one state to another. If there are n states in the Markov chain, the transition matrix will be an n x n matrix, where each element (i,j) of the matrix represents the probability of moving from state i to state j. The sum of each row of the transition matrix must be 1, as the probabilities of moving to each state from the current state must add up to 1.
  • Chapman-Kolmogorov equation: The basic theorem of Markov chain theory is the Chapman-Kolmogorov equation. These equations state that the probability of moving from one state to another over a sequence of steps is the product of the transition probabilities for each step. This means that we can compute the probability of a particular sequence of transitions by multiplying together the probabilities of each individual transition.
  • To illustrate how a Markov chain works, let’s consider an example of a simple weather model. 
  • Suppose that we have three states: sunny, cloudy, and rainy. The transition matrix for this Markov chain might look like this:

















  • This matrix represents the probabilities of moving from one weather state to another. For example, if it is currently sunny, there is an 80% chance that it will be sunny again tomorrow, a 10% chance that it will be cloudy, and a 10% chance that it will be rainy. Similarly, if it is currently cloudy, there is a 40% chance that it will remain cloudy, a 40% chance that it will become sunny, and a 20% chance that it will become rainy.
  • To use a Markov chain to generate a sequence of weather states, we start with an initial state and use the transition probabilities to randomly select the next state. We then repeat this process to generate a sequence of weather states. For example, if we start with a sunny day, we might generate the following sequence:
 sunny -> sunny -> rainy -> rainy -> cloudy -> rainy -> sunny -> sunny -> sunny -> ...


  • Imagine you’re playing a game where you have to move from one room to another. Each room has a different color, and you can only move to certain rooms from each room. For example, if you’re in a red room, you can only move to a green or blue room.
  • A Markov chain is like this game but with numbers instead of colors. The rooms are called “states”, and the different paths you can take between them are called “transitions”. Each transition has a probability, which is like a chance of moving to the next state.
  • So, let’s say there are 3 states or rooms. you’re in State 1, and you have to move to State 2 or State 3. If the probability of moving to state 2 is 0.3, and the probability of moving to state 3 is 0.7, it means there’s a 30% chance of moving to state 2 and a 70% chance of moving to state 3.
  • The transition probabilities are written in a special table called a “transition matrix”. This matrix tells you the probability of moving from one state to another. For example, if you’re in state 1 and want to move to state 2 or state 3, you would look at the row for state 1 and the columns for state 2 and state 3 to find the probabilities.
  • Markov chains are used to model many things, like weather patterns, stock prices, and even text. They are especially useful when you want to predict what might happen in the future based on what’s happening right now.

Markov Chains in Natural Language Processing (NLP)

They have been widely used in Natural Language Processing (NLP) applications, such as text generation, speech recognition, and sentiment analysis. In this article, we will discuss the concepts related to Markov Chains in NLP, the steps involved in using them, and provide good examples with proper explanations.

Before we dive into the application of Markov Chains in NLP, let’s review some of the key concepts related to this topic:

  1. Markov Property: A process has the Markov property if the probability of moving to a future state depends only on the present state and not on the past.
  2. Transition Matrix: A matrix that represents the probabilities of moving from one state to another state.
  3. Stationary Distribution: A distribution of probabilities that remains unchanged after a transition from one state to another.
  4. N-grams: A contiguous sequence of n items (words or characters) from a given sample of text.
  5. Language Model: A statistical model that assigns probabilities to sequences of words in a language.

Markov chain algorithm for generating sentences

To implement a Markov chain algorithm for generating sentences, we can follow a similar approach. We start by analyzing a corpus of text to determine the probabilities of transitioning from one word to another. For example, suppose we have the following sentence:

The quick brown fox jumps over the lazy dog.

We can create a Markov chain by treating each word as a state and analyzing the probability of transitioning from one word to another. For example, we might find that the probability of transitioning from “the” to “quick” is 0.5, the probability of transitioning from “quick” to “brown” is 1.0, and so on based on large corpus text data study. Once we have computed the transition probabilities, we can generate a new sentence by starting with an initial word and randomly selecting the next word based on the transition probabilities.


Now, let’s discuss the steps involved in using Markov Chains for text generation in NLP. The steps are as follows:

Step 1: Data Preprocessing

The first step in any NLP task is data preprocessing. In this step, we clean the text data by removing unnecessary characters, converting the text to lowercase, and removing stop words.


import re
text = "I love cats. Cats are my favorite animal. I have two cats."
# Remove unnecessary characters
text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
# Convert to lowercase
text = text.lower()


i love cats cats are my favorite animal i have two cats

Step 2: Generating N-grams

Next, we generate N-grams from the preprocessed text. N-grams are contiguous sequences of n words, where n is usually 2 or 3. For example, “the cat sat” is a 3-gram.


from nltk import ngrams
n = 2
# Generate 2-grams
n_grams = ngrams(text.split(), n)
# Convert to list of tuples
n_grams = list(n_grams)


[('i', 'love'), ('love', 'cats'), ('cats', 'cats'), ('cats', 'are'), ('are', 'my'),
 ('my', 'favorite'), ('favorite', 'animal'), ('animal', 'i'), ('i', 'have'), 
 ('have', 'two'), ('two', 'cats')]

Step 3: Building a Transition Matrix

After generating N-grams, we build a transition matrix that represents the probabilities of moving from one word to another. We calculate these probabilities by counting the number of times a particular word appears after another word in the N-grams.


import numpy as np
# Get unique words
unique_words = list(set(text.split()))
# Create transition matrix
transition_matrix = np.zeros((len(unique_words), len(unique_words)))
# Fill transition matrix
for i, word in enumerate(unique_words):
    for j, next_word in enumerate(unique_words):
        # Count the number of times a word appears followed by next_word
        count = 0
        for n_gram in n_grams:
            if n_gram[0] == word and n_gram[1] == next_word:
                count += 1
        transition_matrix[i, j] = count
# Normalize transition matrix
transition_matrix = transition_matrix / \
    transition_matrix.sum(axis=1, keepdims=True)


[[0.  0.  0.  0.  0.  1.  0.  0.  0. ]
 [0.  0.  0.  0.5 0.  0.  0.5 0.  0. ]
 [0.  0.  0.5 0.  0.  0.  0.  0.  0.5]
 [0.  0.  1.  0.  0.  0.  0.  0.  0. ]
 [0.  1.  0.  0.  0.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  1.  0.  0.  0.  0. ]
 [0.  0.  0.  0.  0.  0.  0.  1.  0. ]
 [0.  0.  1.  0.  0.  0.  0.  0.  0. ]
 [1.  0.  0.  0.  0.  0.  0.  0.  0. ]]

Step 4: Generating Text

Once we have the transition matrix, we can generate new text by starting with an initial word and randomly selecting the next word based on the probabilities in the transition matrix. We repeat this process until we have generated the desired amount of text.


# Set initial word
current_word = "i"
# Generate text
generated_text = current_word
for i in range(10):
    # Get index of current word
    current_word_index = unique_words.index(current_word)
    # Get probabilities for next word
    probabilities = transition_matrix[current_word_index]
    # Select next word randomly based on probabilities
    next_word_index = np.random.choice(len(unique_words), p=probabilities)
    next_word = unique_words[next_word_index]
    # Add next word to generated text
    generated_text += " " + next_word
    # Set current word to next word
    current_word = next_word
# Print generated text


i have two cats cats are my favorite animal i have

Build the Markov model for a Large corpus

Step 1: Import the necessary libraries and Dataset


from datasets import load_dataset
import numpy as np
import pandas as pd
import os
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import random
from datasets import load_dataset
dataset = load_dataset("cfilt/iitb-english-hindi")


    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    train: Dataset({
        features: ['translation'],
        num_rows: 1659083
    test: Dataset({
        features: ['translation'],
        num_rows: 2507

Step 2: Create an English text corpus


English = []
for translation_pair in dataset["train"]["translation"]:
    english = translation_pair['en']



Step 3: Preprocessing


def Tokenize(txt):
    cleaned_txt = []
    for line in txt:
        line = line.lower()
        line = re.sub(r"[,.\"\'!@#$%^&*(){}?/;`~:<>+=-\\]", "", line)
        tokens = word_tokenize(line)
        words = [word for word in tokens if word.isalpha()]
        cleaned_txt += words
    return cleaned_txt
Tokens = Tokenize(English)
print("number of words = ", len(Tokens))


number of words =  20930538

Step 4: Build the Markov Model


class MarkovModel:
    def __init__(self, n_gram=2):
        self.n_gram = n_gram
        self.markov_model = {}
    def build_model(self, text):
        for i in range(len(text)-self.n_gram-1):
            curr_state, next_state = "", ""
            for j in range(self.n_gram):
                curr_state += text[i+j] + " "
                next_state += text[i+j+self.n_gram] + " "
            curr_state = curr_state[:-1]
            next_state = next_state[:-1]
            if curr_state not in self.markov_model:
                self.markov_model[curr_state] = {}
                self.markov_model[curr_state][next_state] = 1
                if next_state in self.markov_model[curr_state]:
                    self.markov_model[curr_state][next_state] += 1
                    self.markov_model[curr_state][next_state] = 1
        # calculating transition probabilities
        for curr_state, transition in self.markov_model.items():
            total = sum(transition.values())
            for state, count in transition.items():
                self.markov_model[curr_state][state] = count/total
    def get_model(self):
        return self.markov_model

Step 5: Trained Model


markov = MarkovModel()
print("number of states = ", len(markov.get_model().keys()))


number of states =  3306270

Step 6: Generate the new text


def generate_entences(markov, limit=100, start='i am'):
    n = 0
    curr_state = start
    next_state = None
    story = ""
    story += curr_state+" "
    while n < limit:
        next_state = random.choices(
        curr_state = next_state[0]
        story += curr_state+" "
        n += 1
    return story
# Generate 10 senetences
for i in range(10):
    print(str(i)+". ", generate_entences(
        markov.get_model(), start='you are', limit=7))


0.  you are my patron in this behalf or for the present the viceroy was taken aback 
1.  you are sending to all the state govts uts have been sanctioned for micro level planning 
2.  you are replying privately to a company formed and registered under the old system in both 
3.  you are believers and most of its mantras in prose form were tested companies on occasion 
4.  you are being tested scientists this is being replaced by three separate circles under one head 
5.  you are told to shout not thy prayer o moses make for the seeking for the 
6.  you are carried on by the indian people in powerful and militant islam stands for louis 
7.  you are not a valid folder please choose the audio you want to retract the distal 
8.  you are truthful if only those who repent and believe in him he has created you 
9.  you are face towards the sacred month for the past and also by the night when 

To summarize, Markov Chains is a statistical model that allows us to model a sequence of events and predict what is likely to happen next based on what has happened before. In natural language processing, Markov Chains can be used to generate text that is similar to a given corpus, perform tasks such as sentiment analysis, and more.

The basic steps for using Markov Chains in NLP are as follows:

  1. Choose a corpus of text to use as input for the Markov Chain.
  2. Parse the text into sequences of words or characters, depending on the desired level of granularity.
  3. Build a transition matrix that represents the probabilities of moving from one state (e.g., word or character) to another.
  4. Use the transition matrix to generate new sequences of text that are similar to the input corpus.

In addition to NLP, Markov Chains have applications in many other fields, such as finance, physics, and biology. They are a simple but powerful way to model complex systems and make predictions based on limited information.

Overall, Markov Chains are a valuable tool in the data scientist’s toolkit, and anyone working with sequential data should be familiar with them. With their ability to model complex processes and make predictions based on limited information, they are an essential tool for anyone interested in machine learning and data analysis.

