Python – Filtering text using Enchant

Last Updated : 26 May, 2020

Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.

Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text. But at times not all the words are required to be tokenized. Suppose we are spell checking, then it is customary to ignore email addresses and URLs. This can be achieved by modifying the tokenization process with filters.
Currently implemented filters are :

EmailFilter
URLFilter
WikiWordFilter

Example 1 : EmailFilter

# import the required modules 
from enchant.tokenize import get_tokenizer 
from enchant.tokenize import EmailFilter 
  
# the text to be tokenized 
text = "The email is abc@gmail.com"
  
# getting tokenizer class 
tokenizer = get_tokenizer("en_US") 
  
# printing tokens without filtering 
print("Printing tokens without filtering:") 
token_list = [] 
for words in tokenizer(text): 
    token_list.append(words) 
print(token_list) 
  
# getting tokenizer class with filter 
tokenizer_filter = get_tokenizer("en_US", [EmailFilter]) 
  
# printing tokens after filtering 
print("\nPrinting tokens after filtering:") 
token_list_filter = [] 
for words in tokenizer_filter(text): 
    token_list_filter.append(words) 
print(token_list_filter) 

Output :

Printing tokens without filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10), (‘abc’, 13), (‘gmail’, 17), (‘com’, 23)]

Printing tokens after filtering:
[(‘The’, 0), (’email’, 4), (‘is’, 10)

Example 2 : URLFilter

# import the required modules 
from enchant.tokenize import get_tokenizer 
from enchant.tokenize import URLFilter 
  
# the text to be tokenized 
text = "This is an URL: https://www.geeksforgeeks.org/"
  
# getting tokenizer class 
tokenizer = get_tokenizer("en_US") 
  
# printing tokens without filtering 
print("Printing tokens without filtering:") 
token_list = [] 
for words in tokenizer(text): 
    token_list.append(words) 
print(token_list) 
  
  
# getting tokenizer class with filter 
tokenizer_filter = get_tokenizer("en_US", [URLFilter]) 
  
# printing tokens after filtering 
print("\nPrinting tokens after filtering:") 
token_list_filter = [] 
for words in tokenizer_filter(text): 
    token_list_filter.append(words) 
print(token_list_filter) 

Output :

Printing tokens without filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11), (‘https’, 16), (‘www’, 24), (‘geeksforgeeks’, 28), (‘org’, 42)]

Printing tokens after filtering:
[(‘This’, 0), (‘is’, 5), (‘an’, 8), (‘URL’, 11)]

Example 3 : WikiWordFilter
A WikiWord is a word which consists of two or more words with initial capitals, run together.

# import the required modules 
from enchant.tokenize import get_tokenizer 
from enchant.tokenize import WikiWordFilter 
  
# the text to be tokenized 
text = "VersionFiveDotThree is an example of WikiWord"
  
# getting tokenizer class 
tokenizer = get_tokenizer("en_US") 
  
# printing tokens without filtering 
print("Printing tokens without filtering:") 
token_list = [] 
for words in tokenizer(text): 
    token_list.append(words) 
print(token_list) 
  
# getting tokenizer class with filter 
tokenizer_filter = get_tokenizer("en_US", [WikiWordFilter]) 
  
# printing tokens after filtering 
print("\nPrinting tokens after filtering:") 
token_list_filter = [] 
for words in tokenizer_filter(text): 
    token_list_filter.append(words) 
print(token_list_filter) 

Output :

Printing tokens without filtering:
[(‘VersionFiveDotThree’, 0), (‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34), (‘WikiWord’, 37)]

Printing tokens after filtering:
[(‘is’, 20), (‘an’, 23), (‘example’, 26), (‘of’, 34)]

Suggest improvement

Text detection using Python

Share your thoughts in the comments

Python – Filtering text using Enchant

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?