Open In App

Autocorrector Feature Using NLP In Python

Last Updated : 21 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Autocorrect is a way of predicting or making the wrong spellings correct, which makes the tasks like writing paragraphs, reports, and articles easier. Today there are a lot of Websites and Social media platforms that use this concept to make web apps user-friendly.

Autocorrector Feature Using NLP In Python

So, here we are using Machine Learning and NLP to make an autocorrection generator that will suggest to us the correct spellings for the input word. We will be using Python Programming Language for this.

Let’s move ahead with the project.

We will be using NTLK Library for the implementation of NLP-related tasks.

To import NLTK use the below command 

import nltk
nltk.download('all')

Then the first task is to import the text file we will be using to create the word list of correct words.

You can download the text file from this link.

Python3




# importing regular expression
import re
 
# words
w = []
 
# reading text file
with open('final.txt', 'r', encoding="utf8") as f:
    file_name_data = f.read()
    file_name_data = file_name_data.lower()
    w = re.findall('\w+', file_name_data)
 
# vocabulary
main_set = set(w)


Now we have to count the words and store their frequency. For that we will use dictionary.

Python3




# Functions to count the frequency
# of the words in the whole text file
 
 
def counting_words(words):
    word_count = {}
    for word in words:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    return word_count


Then to calculate the probability of the words prob_cal function is used.

Python3




# Calculating the probability of each word
def prob_cal(word_count_dict):
    probs = {}
    m = sum(word_count_dict.values())
    for key in word_count_dict.keys():
        probs[key] = word_count_dict[key] / m
    return probs


The further code is divided into 5 main parts, that includes the creation of all types of different words that are possible.

To do this, we can use : 

  1. Lemmatization
  2. Deletion of letter
  3. Switching Letter
  4. Replace Letter
  5. Insert new Letter

Let’s see the code implementation of each point

To do Lemmatization we will be using pattern module. You can install it using the below command

pip install pattern

Then you can the below code

Python3




# LemmWord: extracting and adding
# root word i.e.Lemma using pattern module
import pattern
from pattern.en import lemma, lexeme
from nltk.stem import WordNetLemmatizer
 
 
def LemmWord(word):
    return list(lexeme(wd) for wd in word.split())[0]


DeleteLetter : Function that Removes a letter from a given word.

Python3




# Deleting letters from the words
def DeleteLetter(word):
    delete_list = []
    split_list = []
 
    # considering letters 0 to i then i to -1
    # Leaving the ith letter
    for i in range(len(word)):
        split_list.append((word[0:i], word[i:]))
 
    for a, b in split_list:
        delete_list.append(a + b[1:])
    return delete_list


Switch_ : This function swaps two letters of the word.

Python3




# Switching two letters in a word
def Switch_(word):
    split_list = []
    switch_l = []
 
    #creating pair of the words(and breaking them)
    for i in range(len(word)):
        split_list.append((word[0:i], word[i:]))
     
    #Printint the first word (i.e. a)
    #then replacing the first and second character of b
    switch_l = [a + b[1] + b[0] + b[2:] for a, b in split_list if len(b) >= 2]
    return switch_l


Replace_ : It changes one letter to another.

Python3




def Replace_(word):
    split_l = []
    replace_list = []
 
    # Replacing the letter one-by-one from the list of alphs
    for i in range(len(word)):
        split_l.append((word[0:i], word[i:]))
    alphs = 'abcdefghijklmnopqrstuvwxyz'
    replace_list = [a + l + (b[1:] if len(b) > 1 else '')
                    for a, b in split_l if b for l in alphs]
    return replace_list


insert_: It adds additional characters from the bunch of alphabets (one-by-one). 

Python3




def insert_(word):
    split_l = []
    insert_list = []
 
    # Making pairs of the split words
    for i in range(len(word) + 1):
        split_l.append((word[0:i], word[i:]))
 
    # Storing new words in a list
    # But one new character at each location
    alphs = 'abcdefghijklmnopqrstuvwxyz'
    insert_list = [a + l + b for a, b in split_l for l in alphs]
    return insert_list


Now, we have implemented all the five steps. It’s time to merge all the words (i.e. all functions) formed by those steps.

To implement that we will be using 2 different functions

Python3




# Collecting all the words
# in a set(so that no word will repeat)
def colab_1(word, allow_switches=True):
    colab_1 = set()
    colab_1.update(DeleteLetter(word))
    if allow_switches:
        colab_1.update(Switch_(word))
    colab_1.update(Replace_(word))
    colab_1.update(insert_(word))
    return colab_1
 
# collecting words using by allowing switches
def colab_2(word, allow_switches=True):
    colab_2 = set()
    edit_one = colab_1(word, allow_switches=allow_switches)
    for w in edit_one:
        if w:
            edit_two = colab_1(w, allow_switches=allow_switches)
            colab_2.update(edit_two)
    return colab_2


Now, The main task is to extract the correct words among all. To do so we will be using a get_corrections function.

Python3




# Only storing those values which are in the vocab
def get_corrections(word, probs, vocab, n=2):
    suggested_word = []
    best_suggestion = []
    suggested_word = list(
        (word in vocab and word) or colab_1(word).intersection(vocab)
        or colab_2(word).intersection(
            vocab))
 
    # finding out the words with high frequencies
    best_suggestion = [[s, probs[s]] for s in list(reversed(suggested_word))]
    return best_suggestion


Now the code is ready,  we can test it for any user input by the below code.

Let’s print top 3 suggestions made by the Autocorrect.

Python3




# Input
my_word = input("Enter any word:")
 
# Counting word function
word_count = counting_words(main_set)
 
# Calculating probability
probs = probab_cal(word_count)
 
# only storing correct words
tmp_corrections = get_corrections(my_word, probs, main_set, 2)
for i, word_prob in enumerate(tmp_corrections):
    if(i < 3):
        print(word_prob[0])
    else:
        break


Output : 

Enter any word:daedd
dared
daned
died

Conclusion

So, we have implemented the basic auto-corrector using the NLTK Library and Python. For further steps, we can work on the High level auto-corrector system which uses the large amount of dataset and works more efficiently. 

To enhance accuracy, we can also use transformers and more NLP related techniques like n-grams, Tf-idf, and so on.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads