While the minimum edit distance discussed in this article provides a good list of possibly correct words, there are far too many words in the English dictionary to consider finding the edit distance between all pairs. To simplify the list of candidate words, the k-gram overlap is used in typical IR and NLP systems.
K-grams are k-length subsequences of a string. Here, k can be 1, 2, 3 and so on. For k=1, each resulting subsequence is called a “unigram”; for k=2, a “bigram”; and for k=3, a “trigram”. These are the most widely used k-grams for spelling correction, but the value of k really depends on the situation and context.
As an example, consider the string “catastrophic”. In this case,
- Unigrams: [“c”, “a”, “t”, “a”, “s”, “t”, “r”, “o”, “p”, “h”, “i”, “c”]
- Bigrams: [“ca”, “at”, “ta”, “as”, “st”, “tr”, “ro”, “op”, “ph”, “hi”, “ic”]
- Trigrams: [“cat”, “ata”, “tas”, “ast”, “str”, “tro”, “rop”, “oph”, “phi”, “hic”]
A k-gram index maps a k-gram to a postings list of all possible vocabulary terms that contain it. The figure below shows the k-gram postings list corresponding to the bigram “ur”.
It is noteworthy that the postings list is sorted alphabetically.
While creating the candidate list of possible corrected words, we can use the “k-gram overlap” to find the most likely corrections.
Consider the misspelt word: “appe”. The postings lists for the bigrams contained in it are shown below. Note that these are only sample subsets of the postings lists; the actual postings list would, of course, contain thousands of words in them.
To find the k-gram overlap between two postings list, we use the Jaccard coefficient. Here, A and B are two sets (postings lists), A for the misspelt word and B for the corrected word.
Now, consider some candidate terms for spelling correction, namely “ape” and “apple”.
To find the Jaccard coefficient, simply scan through the postings lists of all bigrams of “appe” and count the instances where “ape” appears.
In the first postings list, “ape” appears 1 time. In the second postings list, “ape” appears 0 times. In the third postings list, “ape” appears 1 time. Therefore, . Now, the no. of bigrams in “appe” is 3, and the no. of bigrams in “ape” is 2. Therefore, .
J(A, B) = 2/3 = 0.67.
. Now, the no. of bigrams in “appe” is 3, and the no. of bigrams in “apple” is 4. Therefore, .
J(A, B) = 3/4 = 0.75.
This suggests that “apple” is a more plausible correction. Practically, this method is used to filter out unlikely corrections.
The steps involved for spelling correction are:
- Find the k-grams of the misspelled word.
- For each k-gram, linearly scan through the postings list in the k-gram index.
- Find k-gram overlaps after having linearly scanned the lists (no extra time complexity because we are finding the Jaccard coefficient).
- Return the terms with the maximum k-gram overlaps.
- NLP | Verb Forms Correction
- Python | Create a GUI Marksheet using Tkinter
- Bound methods python
- HDF5 files in Python
- Difference between dict.items() and dict.iteritems() in Python
- Analyzing selling price of used cars using Python
- Python | Multiply Integer in Mixed List of string and numbers
- Python - Minimum identical consecutive Subarray
- Python - Concatenation of two String Tuples
- Python - Summation of kth column in a matrix
- Python | Find frequency of largest element in list
- Difference between input() and raw_input() functions in Python
- Python slice() function
- Python - Summation of float string list
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.