While the minimum edit distance discussed in this article provides a good list of possibly correct words, there are far too many words in the English dictionary to consider finding the edit distance between all pairs. To simplify the list of candidate words, the k-gram overlap is used in typical IR and NLP systems.
K-grams are k-length subsequences of a string. Here, k can be 1, 2, 3 and so on. For k=1, each resulting subsequence is called a “unigram”; for k=2, a “bigram”; and for k=3, a “trigram”. These are the most widely used k-grams for spelling correction, but the value of k really depends on the situation and context.
As an example, consider the string “catastrophic”. In this case,
- Unigrams: [“c”, “a”, “t”, “a”, “s”, “t”, “r”, “o”, “p”, “h”, “i”, “c”]
- Bigrams: [“ca”, “at”, “ta”, “as”, “st”, “tr”, “ro”, “op”, “ph”, “hi”, “ic”]
- Trigrams: [“cat”, “ata”, “tas”, “ast”, “str”, “tro”, “rop”, “oph”, “phi”, “hic”]
A k-gram index maps a k-gram to a postings list of all possible vocabulary terms that contain it. The figure below shows the k-gram postings list corresponding to the bigram “ur”.
It is noteworthy that the postings list is sorted alphabetically.
While creating the candidate list of possible corrected words, we can use the “k-gram overlap” to find the most likely corrections.
Consider the misspelt word: “appe”. The postings lists for the bigrams contained in it are shown below. Note that these are only sample subsets of the postings lists; the actual postings list would, of course, contain thousands of words in them.
To find the k-gram overlap between two postings list, we use the Jaccard coefficient. Here, A and B are two sets (postings lists), A for the misspelt word and B for the corrected word.
Now, consider some candidate terms for spelling correction, namely “ape” and “apple”.
To find the Jaccard coefficient, simply scan through the postings lists of all bigrams of “appe” and count the instances where “ape” appears.
In the first postings list, “ape” appears 1 time. In the second postings list, “ape” appears 0 times. In the third postings list, “ape” appears 1 time. Therefore, . Now, the no. of bigrams in “appe” is 3, and the no. of bigrams in “ape” is 2. Therefore, .
J(A, B) = 2/3 = 0.67.
. Now, the no. of bigrams in “appe” is 3, and the no. of bigrams in “apple” is 4. Therefore, .
J(A, B) = 3/4 = 0.75.
This suggests that “apple” is a more plausible correction. Practically, this method is used to filter out unlikely corrections.
The steps involved for spelling correction are:
- Find the k-grams of the misspelled word.
- For each k-gram, linearly scan through the postings list in the k-gram index.
- Find k-gram overlaps after having linearly scanned the lists (no extra time complexity because we are finding the Jaccard coefficient).
- Return the terms with the maximum k-gram overlaps.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.
- NLP | Verb Forms Correction
- PyQt5 QDateEdit - Setting Correction Mode
- PyQt5 QDateEdit - Getting Correction Mode
- Python - Spelling checker using Enchant
- Python | Consecutive prefix overlap concatenation
- Spelling checker in Python
- Python | Create video using multiple images using OpenCV
- Python | Create a stopwatch using clock object in kivy using .kv file
- Circular (Oval like) button using canvas in kivy (using .kv file)
- Image resizing using Seam carving using OpenCV in Python
- Send mail from your Gmail account using Python
- Cartooning an Image using OpenCV - Python
- Using Iterations in Python Effectively
- Generate a graph using Dictionary in Python
- Create a Website Alarm Using Python
- Whatsapp using Python!
- Downloading files from web using Python
- Mouse and keyboard automation using Python
- GET and POST requests using Python
- Real-Time Edge Detection using OpenCV in Python | Canny edge detection method
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.