Tokenization Using Spacy library
Last Updated :
16 Sep, 2022
Before moving to the explanation of tokenization, let’s first discuss what is Spacy. Spacy is a library that comes under NLP (Natural Language Processing). It is an object-oriented Library that is used to deal with pre-processing of text, and sentences, and to extract information from the text using modules and functions.
Tokenization is the process of splitting a text or a sentence into segments, which are called tokens. It is the first step of text preprocessing and is used as input for subsequent processes like text classification, lemmatization, etc.
Process followed to convert text into tokens
Creating a blank language object gives a tokenizer and an empty pipeline to add modules in the pipeline along with a tokenizer we can use:
Intermediate steps for tokenization
Below is the Implementation
Python
import spacy
nlp = spacy.blank( "en" )
doc = nlp("GeeksforGeeks is a one stop\
learning destination for geeks.")
for token in doc:
print (token)
|
Output:
GeeksforGeeks
is
a
one
stop
learning
destination
for
geeks
.
We can also add functionality in tokens by adding other modules in the pipeline using spacy.load().
Python3
nlp = spacy.load( "en_core_web_sm" )
nlp.pipe_names
|
Output:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Here is an example to show what other functionalities can be enhanced by adding modules to the pipeline.
Python
import spacy
nlp = spacy.load( "en_core_web_sm" )
doc = nlp("If you want to be an excellent programmer \
, be consistent to practice daily on GFG.")
for token in doc:
print (token, " | " ,
spacy.explain(token.pos_),
" | " , token.lemma_)
|
Output:
If | subordinating conjunction | if
you | pronoun | you
want | verb | want
to | particle | to
be | auxiliary | be
an | determiner | an
excellent | adjective | excellent
programmer | noun | programmer
, | punctuation | ,
be | auxiliary | be
consistent | adjective | consistent
to | particle | to
practice | verb | practice
daily | adverb | daily
on | adposition | on
GFG | proper noun | GFG
. | punctuation | .
In the above example, we have used part of speech (POS) and lemmatization using NLP modules, which resulted in POS for every word and lemmatization (a process to reduce every token to its base form). We were not able to access this functionality before, this functionality is only added after we loaded our NLP instance with (“en_core_web_sm”).
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...