Python – Filtering text using Enchant
Enchant
is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.
Enchant
also provides the enchant.tokenize
module to tokenize text. Tokenizing involves splitting words from the body of the text. But at times not all the words are required to be tokenized. Suppose we are spell checking, then it is customary to ignore email addresses and URLs. This can be achieved by modifying the tokenization process with filters.
Currently implemented filters are :
- EmailFilter
- URLFilter
- WikiWordFilter
Example 1 : EmailFilter
# import the required modules from enchant.tokenize import get_tokenizer from enchant.tokenize import EmailFilter # the text to be tokenized text = "The email is abc@gmail.com" # getting tokenizer class tokenizer = get_tokenizer( "en_US" ) # printing tokens without filtering print ( "Printing tokens without filtering:" ) token_list = [] for words in tokenizer(text): token_list.append(words) print (token_list) # getting tokenizer class with filter tokenizer_filter = get_tokenizer( "en_US" , [EmailFilter]) # printing tokens after filtering print ( "\nPrinting tokens after filtering:" ) token_list_filter = [] for words in tokenizer_filter(text): token_list_filter.append(words) print (token_list_filter) |
Output :
Printing tokens without filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10), (‘abc’, 13), (‘gmail’, 17), (‘com’, 23)]Printing tokens after filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10)
Example 2 : URLFilter
# import the required modules from enchant.tokenize import get_tokenizer from enchant.tokenize import URLFilter # the text to be tokenized # getting tokenizer class tokenizer = get_tokenizer( "en_US" ) # printing tokens without filtering print ( "Printing tokens without filtering:" ) token_list = [] for words in tokenizer(text): token_list.append(words) print (token_list) # getting tokenizer class with filter tokenizer_filter = get_tokenizer( "en_US" , [URLFilter]) # printing tokens after filtering print ( "\nPrinting tokens after filtering:" ) token_list_filter = [] for words in tokenizer_filter(text): token_list_filter.append(words) print (token_list_filter) |
Output :
Printing tokens without filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11), (‘https’, 16), (‘www’, 24), (‘geeksforgeeks’, 28), (‘org’, 42)]Printing tokens after filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11)]
Example 3 : WikiWordFilter
A WikiWord is a word which consists of two or more words with initial capitals, run together.
# import the required modules from enchant.tokenize import get_tokenizer from enchant.tokenize import WikiWordFilter # the text to be tokenized text = "VersionFiveDotThree is an example of WikiWord" # getting tokenizer class tokenizer = get_tokenizer( "en_US" ) # printing tokens without filtering print ( "Printing tokens without filtering:" ) token_list = [] for words in tokenizer(text): token_list.append(words) print (token_list) # getting tokenizer class with filter tokenizer_filter = get_tokenizer( "en_US" , [WikiWordFilter]) # printing tokens after filtering print ( "\nPrinting tokens after filtering:" ) token_list_filter = [] for words in tokenizer_filter(text): token_list_filter.append(words) print (token_list_filter) |
Output :
Printing tokens without filtering:
[(‘VersionFiveDotThree’, 0), (‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34), (‘WikiWord’, 37)]Printing tokens after filtering:
[(‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34)]
Please Login to comment...