Open In App
Related Articles

Tokenization Using Spacy library

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Report issue
Report

Before moving to the explanation of tokenization, let’s first discuss what is Spacy. Spacy is a library that comes under NLP (Natural Language Processing). It is an object-oriented Library that is used to deal with pre-processing of text, and sentences, and to extract information from the text using modules and functions.

Tokenization is the process of splitting a text or a sentence into segments, which are called tokens. It is the first step of text preprocessing and is used as input for subsequent processes like text classification, lemmatization, etc.

Process followed to convert text into tokens

Process followed to convert text into tokens

Creating a blank language object gives a tokenizer and an empty pipeline to add modules in the pipeline along with a tokenizer we can use:

            

Intermediate steps for tokenization

Intermediate steps for tokenization

                              

Below is the Implementation

Python

# First we need to import spacy
import spacy
  
# Creating blank language object then
# tokenizing words of the sentence
nlp = spacy.blank("en")
  
doc = nlp("GeeksforGeeks is a one stop\
learning destination for geeks.")
  
for token in doc:
    print(token)

                    

Output:

GeeksforGeeks
is
a
one
stop
learning
destination
for
geeks
.

We can also add functionality in tokens by adding other modules in the pipeline using spacy.load().

Python3

nlp = spacy.load("en_core_web_sm")
  
nlp.pipe_names

                    

Output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Here is an example to show what other functionalities can be enhanced by adding modules to the pipeline.

Python

import spacy
  
# loading modules to the pipeline.
nlp = spacy.load("en_core_web_sm")
  
# Initialising doc with a sentence.
doc = nlp("If you want to be an excellent programmer \
, be consistent to practice daily on GFG.")
  
# Using properties of token i.e. Part of Speech and Lemmatization
for token in doc:
    print(token, " | ",
          spacy.explain(token.pos_),
          " | ", token.lemma_)

                    

Output:

If  |  subordinating conjunction  |  if
you  |  pronoun  |  you
want  |  verb  |  want
to  |  particle  |  to
be  |  auxiliary  |  be
an  |  determiner  |  an
excellent  |  adjective  |  excellent
programmer  |  noun  |  programmer
,  |  punctuation  |  ,
be  |  auxiliary  |  be
consistent  |  adjective  |  consistent
to  |  particle  |  to
practice  |  verb  |  practice
daily  |  adverb  |  daily
on  |  adposition  |  on
GFG  |  proper noun  |  GFG
.  |  punctuation  |  .

In the above example, we have used part of speech (POS) and lemmatization using NLP modules, which resulted in POS for every word and lemmatization (a process to reduce every token to its base form). We were not able to access this functionality before, this functionality is only added after we loaded our NLP instance with (“en_core_web_sm”). 



Last Updated : 16 Sep, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads