With the help of NLTK nltk.tokenize.mwe()
method, we can tokenize the audio stream into multi_word expression token which helps to bind the tokens with underscore by using nltk.tokenize.mwe()
method. Remember it is case sensitive.
Syntax : MWETokenizer.tokenize()
Return : Return bind tokens as one if declared before.
Example #1 :
In this example we are using MWETokenizer.tokenize()
method, which used to bind the tokens which is defined before. We can also add the predefined tokens by using tokenizer.add_mwe()
method.
from nltk.tokenize import MWETokenizer
tk = MWETokenizer([( 'g' , 'f' , 'g' ), ( 'geeks' , 'for' , 'geeks' )])
gfg = "geeks for geeks g f g"
geek = tk.tokenize(gfg.split())
print (geek)
|
Output :
[‘geeks_for_geeks’, ‘g_f_g’]
Example #2 :
from nltk.tokenize import MWETokenizer
tk = MWETokenizer([( 'g' , 'f' , 'g' ), ( 'geeks' , 'for' , 'geeks' )])
tk.add_mwe(( 'who' , 'are' , 'you' ))
gfg = "who are you at geeks for geeks"
geek = tk.tokenize(gfg.split())
print (geek)
|
Output :
[‘who_are_you’, ‘at’, ‘geeks_for_geeks’]