Tokenization Using Spacy library

Last Updated : 16 Sep, 2022

Before moving to the explanation of tokenization, let’s first discuss what is Spacy. Spacy is a library that comes under NLP (Natural Language Processing). It is an object-oriented Library that is used to deal with pre-processing of text, and sentences, and to extract information from the text using modules and functions.

Tokenization is the process of splitting a text or a sentence into segments, which are called tokens. It is the first step of text preprocessing and is used as input for subsequent processes like text classification, lemmatization, etc.

Process followed to convert text into tokens

Creating a blank language object gives a tokenizer and an empty pipeline to add modules in the pipeline along with a tokenizer we can use:

Intermediate steps for tokenization

Below is the Implementation

Python

# First we need to import spacy 
import spacy 
  
# Creating blank language object then 
# tokenizing words of the sentence 
nlp = spacy.blank("en") 
  
doc = nlp("GeeksforGeeks is a one stop\ 
learning destination for geeks.") 
  
for token in doc: 
    print(token) 

Output:

GeeksforGeeks
is
a
one
stop
learning
destination
for
geeks
.

We can also add functionality in tokens by adding other modules in the pipeline using spacy.load().

Python3

nlp = spacy.load("en_core_web_sm") 
  
nlp.pipe_names

Output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Here is an example to show what other functionalities can be enhanced by adding modules to the pipeline.

Python

import spacy 
  
# loading modules to the pipeline. 
nlp = spacy.load("en_core_web_sm") 
  
# Initialising doc with a sentence. 
doc = nlp("If you want to be an excellent programmer \ 
, be consistent to practice daily on GFG.") 
  
# Using properties of token i.e. Part of Speech and Lemmatization 
for token in doc: 
    print(token, " | ", 
          spacy.explain(token.pos_), 
          " | ", token.lemma_) 

Output:

If  |  subordinating conjunction  |  if
you  |  pronoun  |  you
want  |  verb  |  want
to  |  particle  |  to
be  |  auxiliary  |  be
an  |  determiner  |  an
excellent  |  adjective  |  excellent
programmer  |  noun  |  programmer
,  |  punctuation  |  ,
be  |  auxiliary  |  be
consistent  |  adjective  |  consistent
to  |  particle  |  to
practice  |  verb  |  practice
daily  |  adverb  |  daily
on  |  adposition  |  on
GFG  |  proper noun  |  GFG
.  |  punctuation  |  .

In the above example, we have used part of speech (POS) and lemmatization using NLP modules, which resulted in POS for every word and lemmatization (a process to reduce every token to its base form). We were not able to access this functionality before, this functionality is only added after we loaded our NLP instance with (“en_core_web_sm”).

Suggest improvement

Word Tokenization Using R

Share your thoughts in the comments

Tokenization Using Spacy library

Below is the Implementation

Python

Python3

Python

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?