Which data structure can be used for efficiently building a word dictionary and Spell Checker?
The answer depends upon the functionalists required in Spell Checker and availability of memory. For example following are few possibilities.
Hashing is one simple option for this. We can put all words in a hash table. Refer this paper which compares hashing with self-balancing Binary Search Trees and Skip List, and shows that hashing performs better.
Hashing doesn’t support operations like prefix search. Prefix search is something where a user types a prefix and your dictionary shows all words starting with that prefix. Hashing also doesn’t support efficient printing of all words in dictionary in alphabetical order and nearest neighbor search.
If we want both operations, look up and prefix search, Trie is suited. With Trie, we can support all operations like insert, search, delete in O(n) time where n is length of the word to be processed. Another advantage of Trie is, we can print all words in alphabetical order which is not possible with hashing.
The disadvantage of Trie is, it requires lots of space. If space is concern, then Ternary Search Tree can be preferred. In Ternary Search Tree, time complexity of search operation is O(h) where h is height of the tree. Ternary Search Trees also supports other operations supported by Trie like prefix search, alphabetical order printing and nearest neighbor search.
If we want to support suggestions, like google shows “did you mean …”, then we need to find the closest word in dictionary. The closest word can be defined as the word that can be obtained with minimum number of character transformations (add, delete, replace). A Naive way is to take the given word and generate all words which are 1 distance (1 edit or 1 delete or 1 replace) away and one by one look them in dictionary. If nothing found, then look for all words which are 2 distant and so on. There are many complex algorithms for this. As per the wiki page, The most successful algorithm to date is Andrew Golding and Dan Roth’s Window-based spelling correction algorithm.
See this for a simple spell checker implementation.
This article is compiled by Piyush. Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
- Advantages of BST over Hash Table
- Advantages of Trie Data Structure
- Trie | (Insert and Search)
- Ternary Search Tree
- LRU Cache Implementation
- Cartesian Tree
- Design a data structure that supports insert, delete, search and getRandom in constant time
- Segment Tree | Set 2 (Range Maximum Query with Node Update)
- Ordered Set and GNU C++ PBDS
- Count elements which divide all numbers in range L-R