Which data structure can be used for efficiently building a word dictionary and Spell Checker?
The answer depends upon the functionalists required in Spell Checker and availability of memory. For example following are few possibilities.
Hashing is one simple option for this. We can put all words in a hash table. Refer this paper which compares hashing with self-balancing Binary Search Trees and Skip List, and shows that hashing performs better.
Hashing doesn’t support operations like prefix search. Prefix search is something where a user types a prefix and your dictionary shows all words starting with that prefix. Hashing also doesn’t support efficient printing of all words in dictionary in alphabetical order and nearest neighbor search.
If we want both operations, look up and prefix search, Trie is suited. With Trie, we can support all operations like insert, search, delete in O(n) time where n is length of the word to be processed. Another advantage of Trie is, we can print all words in alphabetical order which is not possible with hashing.
The disadvantage of Trie is, it requires lots of space. If space is concern, then Ternary Search Tree can be preferred. In Ternary Search Tree, time complexity of search operation is O(h) where h is height of the tree. Ternary Search Trees also supports other operations supported by Trie like prefix search, alphabetical order printing and nearest neighbor search.
If we want to support suggestions, like google shows “did you mean …”, then we need to find the closest word in dictionary. The closest word can be defined as the word that can be obtained with minimum number of character transformations (add, delete, replace). A Naive way is to take the given word and generate all words which are 1 distance (1 edit or 1 delete or 1 replace) away and one by one look them in dictionary. If nothing found, then look for all words which are 2 distant and so on. There are many complex algorithms for this. As per the wiki page, The most successful algorithm to date is Andrew Golding and Dan Roth’s Window-based spelling correction algorithm.
See this for a simple spell checker implementation.
This article is compiled by Piyush. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
- Gap Buffer Data Structure
- Tango Tree Data Structure
- Advantages of Trie Data Structure
- Trie Data Structure using smart pointer and OOP in C++
- Design an efficient data structure for given operations
- Ropes Data Structure (Fast String Concatenation)
- Dynamic Disjoint Set Data Structure for large range values
- Design a data structure that supports insert, delete, search and getRandom in constant time
- Implement a Dictionary using Trie
- Word formation using concatenation of two dictionary words
- Check if the given string of words can be formed from words present in the dictionary
- Persistent data structures
- Disjoint Set Data Structures
- Need for Abstract Data Type and ADT Model
- Burrows - Wheeler Data Transform Algorithm