Open In App

TextaCy module in Python

Last Updated : 01 Mar, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will introduce ourselves to the TextaCy module in python which is generally used to perform a variety of NLP tasks on texts. It is built upon the SpaCy module in Python. 

Some of the features of the TextaCy module are as follows:

  • It provides the facility of text cleaning and preprocessing by replacing and removing punctuation, extra whitespaces, numbers, etc from the text before processing it with spaCy.
  • It includes automatic language detection and tokenizes and vectorizes the documents and then train and interpret the topic models.
  • Custom extensions can be added to extend the main functionality of spaCy for working with one or more documents.
  • Load prepared datasets that contain both text content and information, such as Reddit comments, Congressional speeches, and historical books.
  • It provides facility to extract features such as n-grams, entities, acronyms, keyphrases and SVO triples as structured data from processed documents.
  • Strings and sequences can be compared using a variety of similar metrics.
  • Calculates text readability and lexical variety data, such as the Type-Token Ratio, Multilingual Flesch Reading Ease, and Flesch-Kincaid Grade Level.

Installation of TextaCy module:

We can install the textaCy module using pip.

pip install textacy

If someone uses conda then write the following command –

conda install -c conda-forge textacy

Examples of some of its features:

Here we will see some of the notable features of textaCy module.

Remove Punctuation

Using the preprocessing class of textacy module we can easily remove punctuation from our text.

Python3




from textacy import preprocessing
 
 
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
 
print(rm_punc)


The text used here is a randomly generated text from an external website. Firstly, we imported preprocessing class of textacy module and then used the remove and punctuation methods to remove the punctuations. 

Output:

Now is the winter of our discontent
Made glorious summer by this sun of York 
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried 
Now are our brows bound with victorious wreaths 
Our bruised arms hung up for monuments 
Our stern alarums changed to merry meetings 
Our dreadful marches to delightful measures 
Grim visaged war hath smooth d his wrinkled front 
And now  instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I  that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I  that am rudely stamp d  and want love s majesty
To strut before a wanton ambling nymph
I  that am curtail d of this fair proportion

Remove unnecessary Whitespace

We can remove unnecessary whitespaces from our text. It will remove all the extra spaces we have and cut them all to only a single space after each word.

Python3




from textacy import preprocessing
 
 
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
 
print(rm_wsp)


Here we used the normalize class and whitespace method to remove whitespaces.

Output:

In the output, we can see all the excess whitespace is being removed but the punctuations are still there. So if we want to remove that too then we can amalgamate both operations.

Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,

Removing Punctuation and Whitespace together

Python3




from textacy import preprocessing
 
 
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
 
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
 
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
 
print(rm_all)


Output:

Now is the winter of our discontent
Made glorious summer by this sun of York
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried
Now are our brows bound with victorious wreaths
Our bruised arms hung up for monuments
Our stern alarums changed to merry meetings
Our dreadful marches to delightful measures
Grim visaged war hath smooth d his wrinkled front
And now instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I that am rudely stamp d and want love s majesty
To strut before a wanton ambling nymph
I that am curtail d of this fair proportion

Partition a text

Sometimes the text we receive or use is ‘raw’ means unstructured, messy, etc, so before analysis, in the preprocessing stage, we might need to clean them up and partition them based on certain criteria.

Python3




from textacy import preprocessing
from textacy import extract
 
 
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
 
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
 
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
 
# Extracting text
ext = list(extract.keyword_in_context(
    rm_all, 'I', window_width=20, pad_context=True))
 
print(ext)


Output:

Now the output looks a bit complex because the text used here was not appropriate for this cause. But as I have used the text which was already punctuation and whitespace free we can’t see any punctuation or extra whitespace. The blank spaces created here are due to the window_width, all the whitespace that was there in the text has been removed alongside the punctuation.

[('                Now ', 'i', 's the winter of our '), 
('        Now is the w', 'i', 'nter of our disconte'), 
(' the winter of our d', 'i', 'scontent\nMade glorio'), 
('discontent\nMade glor', 'i', 'ous summer by this s'), 
('lorious summer by th', 'i', 's sun of York \nAnd a'), 
('ur d upon our house\n', 'I', 'n the deep bosom of '), 
('som of the ocean bur', 'i', 'ed \nNow are our brow'), 
('re our brows bound w', 'i', 'th victorious wreath'), 
('r brows bound with v', 'i', 'ctorious wreaths \nOu'), 
('ws bound with victor', 'i', 'ous wreaths \nOur bru'), 
('ous wreaths \nOur bru', 'i', 'sed arms hung up for'), 
('hanged to merry meet', 'i', 'ngs \nOur dreadful ma'), 
('adful marches to del', 'i', 'ghtful measures \nGri'), 
('ightful measures \nGr', 'i', 'm visaged war hath s'), 
('ful measures \nGrim v', 'i', 'saged war hath smoot'), 
(' war hath smooth d h', 'i', 's wrinkled front \nAn'), 
('hath smooth d his wr', 'i', 'nkled front \nAnd now'), 
('kled front \nAnd now ', 'i', 'nstead of mounting b'), 
('now instead of mount', 'i', 'ng barded steeds\nTo '), 
(' barded steeds\nTo fr', 'i', 'ght the souls of fea'), 
(' of fearful adversar', 'i', 'es \nHe capers nimbly'), 
('rsaries \nHe capers n', 'i', 'mbly in a lady s cha'), 
('s \nHe capers nimbly ', 'i', 'n a lady s chamber\nT'), 
(' chamber\nTo the lasc', 'i', 'vious pleasing of a '), 
('hamber\nTo the lasciv', 'i', 'ous pleasing of a lu'), 
('the lascivious pleas', 'i', 'ng of a lute \nBut I '), 
('sing of a lute \nBut ', 'I', ' that am not shaped '), 
('not shaped for sport', 'i', 've tricks \nNor made '), 
('aped for sportive tr', 'i', 'cks \nNor made to cou'), 
('ourt an amorous look', 'i', 'ng glass \nI that am '), 
('rous looking glass \n', 'I', ' that am rudely stam'), 
('before a wanton ambl', 'i', 'ng nymph \nI that am '), 
('nton ambling nymph \n', 'I', ' that am curtail d o'), 
('mph \nI that am curta', 'i', 'l d of this fair pro'), 
('t am curtail d of th', 'i', 's fair proportion   '), 
('curtail d of this fa', 'i', 'r proportion        '), 
('of this fair proport', 'i', 'on                  ')]

The below section shows the result if we don’t remove the punctuation or whitespace earlier, I didn’t include the entire output as it is big and as all the punctuation is available alongside whitespace it would look messy.

[('               \nNow ', 'i', 's the winter of our '), 
('       \nNow is the w', 'i', 'nter of our      dis'), 
('winter of our      d', 'i', 'scontent\nMade glorio'), 
('discontent\nMade glor', 'i', 'ous summer by this s'), 
('lorious summer by th', 'i', 's sun of York;\nAnd a'), 
("ur'd upon our house\n", 'I', 'n the         deep b'), 
('som of the ocean bur', 'i', 'ed.\nNow are our brow').......]

Replace URLs from text with other text

We can remove any unnecessary URLs from our text and replace it with some other text –

Python3




from textacy import preprocessing
 
 
# Replace URLs
txt = "https://www.geeksforgeeks.org/ is the best place to learn anything"
rm_url = preprocessing.replace.urls(txt,"GeeksforGeeks")
 
print(rm_url)


Output:

 

Replace emails with other text

Python3




from textacy import preprocessing
 
# Replace Emails
mail = "Send me a mail in the following address - example@gmail.com"
rm_mail = preprocessing.replace.emails(mail,"UserMail")
 
print(rm_mail)


Output:

 

Replace phone number

Python3




from textacy import preprocessing
 
# Replace phone number
num = "Call me at 12345678910"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
 
print(rm_num)


Output:

 

If we pass more than one number then this will replace all of them with NUM.

Python3




from textacy import preprocessing
 
# Replace phone number
num = "Call me at 12345678910 or 7896451235"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
 
print(rm_num)


Output – 

 

Replace any number

Python3




from textacy import preprocessing
 
# Replace Number
n = "Any number like 12 or 86 , maybe 100 etc"
rm_n = preprocessing.replace.numbers(n,"Numbers")
 
print(rm_n)


Output:

 

Remove texts surrounded by Brackets and the brackets too:

Python3




from textacy import preprocessing
 
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling (from the start plus new capabilities in Python 3.11)"""
 
print(preprocessing.remove.brackets(txt))


Output:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica  in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling

 

We can also pass an keyworded argument called only and pass a list of type brackets we only want to be removed. It supports three values square, curly  , round.

Python3




from textacy import preprocessing
 
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the [ABC programming language], which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}"""
 
print(preprocessing.remove.brackets(txt,only=["round","square"]))


Output:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica  in the Netherlands
as a successor to the , which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads