Autocorrector Feature Using NLP In Python

Autocorrect is a way of predicting or making the wrong spellings correct, which makes the tasks like writing paragraphs, reports, and articles easier. Today there are a lot of Websites and Social media platforms that use this concept to make web apps user-friendly.

Autocorrector Feature Using NLP In Python

So, here we are using Machine Learning and NLP to make an autocorrection generator that will suggest to us the correct spellings for the input word. We will be using Python Programming Language for this.

Let’s move ahead with the project.

We will be using NTLK Library for the implementation of NLP-related tasks.

To import NLTK use the below command

import nltk
nltk.download('all')

Then the first task is to import the text file we will be using to create the word list of correct words.

You can download the text file from this link.

Python3

# importing regular expression

import re
 
# words

w = []
 
# reading text file

with open('final.txt', 'r', encoding="utf8") as f:

    file_name_data = f.read()

    file_name_data = file_name_data.lower()

    w = re.findall('\w+', file_name_data)
 
# vocabulary

main_set = set(w)

Now we have to count the words and store their frequency. For that we will use dictionary.

Python3

# Functions to count the frequency
# of the words in the whole text file
 
def counting_words(words):

    word_count = {}

    for word in words:

        if word in word_count:

            word_count[word] += 1

        else:

            word_count[word] = 1

    return word_count

Then to calculate the probability of the words prob_cal function is used.

Python3

# Calculating the probability of each word

def prob_cal(word_count_dict):

    probs = {}

    m = sum(word_count_dict.values())

    for key in word_count_dict.keys():

        probs[key] = word_count_dict[key] / m

    return probs

The further code is divided into 5 main parts, that includes the creation of all types of different words that are possible.

To do this, we can use :

Lemmatization
Deletion of letter
Switching Letter
Replace Letter
Insert new Letter

Let’s see the code implementation of each point

To do Lemmatization we will be using pattern module. You can install it using the below command

pip install pattern

Then you can the below code

Python3

# LemmWord: extracting and adding
# root word i.e.Lemma using pattern module

import pattern

from pattern.en import lemma, lexeme

from nltk.stem import WordNetLemmatizer
 
def LemmWord(word):

    return list(lexeme(wd) for wd in word.split())[0]

DeleteLetter : Function that Removes a letter from a given word.

Python3

# Deleting letters from the words

def DeleteLetter(word):

    delete_list = []

    split_list = []
 
    # considering letters 0 to i then i to -1

    # Leaving the ith letter

    for i in range(len(word)):

        split_list.append((word[0:i], word[i:]))
 
    for a, b in split_list:

        delete_list.append(a + b[1:])

    return delete_list

Switch_ : This function swaps two letters of the word.

Python3

# Switching two letters in a word

def Switch_(word):

    split_list = []

    switch_l = []
 
    #creating pair of the words(and breaking them)

    for i in range(len(word)):

        split_list.append((word[0:i], word[i:]))

    #Printint the first word (i.e. a)

    #then replacing the first and second character of b

    switch_l = [a + b[1] + b[0] + b[2:] for a, b in split_list if len(b) >= 2]

    return switch_l

Replace_ : It changes one letter to another.

Python3

def Replace_(word):

    split_l = []

    replace_list = []
 
    # Replacing the letter one-by-one from the list of alphs

    for i in range(len(word)):

        split_l.append((word[0:i], word[i:]))

    alphs = 'abcdefghijklmnopqrstuvwxyz'

    replace_list = [a + l + (b[1:] if len(b) > 1 else '')

                    for a, b in split_l if b for l in alphs]

    return replace_list

insert_: It adds additional characters from the bunch of alphabets (one-by-one).

Python3

def insert_(word):

    split_l = []

    insert_list = []
 
    # Making pairs of the split words

    for i in range(len(word) + 1):

        split_l.append((word[0:i], word[i:]))
 
    # Storing new words in a list

    # But one new character at each location

    alphs = 'abcdefghijklmnopqrstuvwxyz'

    insert_list = [a + l + b for a, b in split_l for l in alphs]

    return insert_list

Now, we have implemented all the five steps. It’s time to merge all the words (i.e. all functions) formed by those steps.

To implement that we will be using 2 different functions

Python3

# Collecting all the words
# in a set(so that no word will repeat)

def colab_1(word, allow_switches=True):

    colab_1 = set()

    colab_1.update(DeleteLetter(word))

    if allow_switches:

        colab_1.update(Switch_(word))

    colab_1.update(Replace_(word))

    colab_1.update(insert_(word))

    return colab_1
 
# collecting words using by allowing switches

def colab_2(word, allow_switches=True):

    colab_2 = set()

    edit_one = colab_1(word, allow_switches=allow_switches)

    for w in edit_one:

        if w:

            edit_two = colab_1(w, allow_switches=allow_switches)

            colab_2.update(edit_two)

    return colab_2

Now, The main task is to extract the correct words among all. To do so we will be using a get_corrections function.

Python3

# Only storing those values which are in the vocab

def get_corrections(word, probs, vocab, n=2):

    suggested_word = []

    best_suggestion = []

    suggested_word = list(

        (word in vocab and word) or colab_1(word).intersection(vocab)

        or colab_2(word).intersection(

            vocab))
 
    # finding out the words with high frequencies

    best_suggestion = [[s, probs[s]] for s in list(reversed(suggested_word))]

    return best_suggestion

Now the code is ready, we can test it for any user input by the below code.

Let’s print top 3 suggestions made by the Autocorrect.

Python3

# Input

my_word = input("Enter any word:")
 
# Counting word function

word_count = counting_words(main_set)
 
# Calculating probability

probs = probab_cal(word_count)
 
# only storing correct words

tmp_corrections = get_corrections(my_word, probs, main_set, 2)

for i, word_prob in enumerate(tmp_corrections):

    if(i < 3):

        print(word_prob[0])

    else:

        break

Output :

Enter any word:daedd
dared
daned
died

Conclusion

So, we have implemented the basic auto-corrector using the NLTK Library and Python. For further steps, we can work on the High level auto-corrector system which uses the large amount of dataset and works more efficiently.

To enhance accuracy, we can also use transformers and more NLP related techniques like n-grams, Tf-idf, and so on.

Article Tags :

AI-ML-DS

Machine Learning

NLP

Python

Python-nltk