Python NLTK | nltk.tokenize.mwe()

Last Updated : 07 Jun, 2019

With the help of NLTK nltk.tokenize.mwe() method, we can tokenize the audio stream into multi_word expression token which helps to bind the tokens with underscore by using nltk.tokenize.mwe() method. Remember it is case sensitive.

Syntax : MWETokenizer.tokenize()
Return : Return bind tokens as one if declared before.

Example #1 :
In this example we are using MWETokenizer.tokenize() method, which used to bind the tokens which is defined before. We can also add the predefined tokens by using tokenizer.add_mwe() method.

# import MWETokenizer() method from nltk 
from nltk.tokenize import MWETokenizer 
   
# Create a reference variable for Class MWETokenizer 
tk = MWETokenizer([('g', 'f', 'g'), ('geeks', 'for', 'geeks')]) 
   
# Create a string input 
gfg = "geeks for geeks g f g"
   
# Use tokenize method 
geek = tk.tokenize(gfg.split()) 
   
print(geek) 

Output :

[‘geeks_for_geeks’, ‘g_f_g’]

Example #2 :

# import MWETokenizer() method from nltk 
from nltk.tokenize import MWETokenizer 
   
# Create a reference variable for Class MWETokenizer 
tk = MWETokenizer([('g', 'f', 'g'), ('geeks', 'for', 'geeks')]) 
tk.add_mwe(('who', 'are', 'you')) 
   
# Create a string input 
gfg = "who are you at geeks for geeks"
   
# Use tokenize method 
geek = tk.tokenize(gfg.split()) 
   
print(geek) 

Output :

[‘who_are_you’, ‘at’, ‘geeks_for_geeks’]

Suggest improvement

Python NLTK | nltk.TweetTokenizer()

Share your thoughts in the comments

Python NLTK | nltk.tokenize.mwe()

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?