With the help of NLTK nltk.tokenize.mwe()
method, we can tokenize the audio stream into multi_word expression token which helps to bind the tokens with underscore by using nltk.tokenize.mwe()
method. Remember it is case sensitive.
Syntax :
MWETokenizer.tokenize()
Return : Return bind tokens as one if declared before.
Example #1 :
In this example we are using MWETokenizer.tokenize()
method, which used to bind the tokens which is defined before. We can also add the predefined tokens by using tokenizer.add_mwe()
method.
# import MWETokenizer() method from nltk from nltk.tokenize import MWETokenizer
# Create a reference variable for Class MWETokenizer tk = MWETokenizer([( 'g' , 'f' , 'g' ), ( 'geeks' , 'for' , 'geeks' )])
# Create a string input gfg = "geeks for geeks g f g"
# Use tokenize method geek = tk.tokenize(gfg.split())
print (geek)
|
Output :
[‘geeks_for_geeks’, ‘g_f_g’]
Example #2 :
# import MWETokenizer() method from nltk from nltk.tokenize import MWETokenizer
# Create a reference variable for Class MWETokenizer tk = MWETokenizer([( 'g' , 'f' , 'g' ), ( 'geeks' , 'for' , 'geeks' )])
tk.add_mwe(( 'who' , 'are' , 'you' ))
# Create a string input gfg = "who are you at geeks for geeks"
# Use tokenize method geek = tk.tokenize(gfg.split())
print (geek)
|
Output :
[‘who_are_you’, ‘at’, ‘geeks_for_geeks’]