Open In App

TextaCy module in Python

In this article, we will introduce ourselves to the TextaCy module in python which is generally used to perform a variety of NLP tasks on texts. It is built upon the SpaCy module in Python. 

Some of the features of the TextaCy module are as follows:

Installation of TextaCy module:

We can install the textaCy module using pip.



pip install textacy

If someone uses conda then write the following command –

conda install -c conda-forge textacy

Examples of some of its features:

Here we will see some of the notable features of textaCy module.



Remove Punctuation

Using the preprocessing class of textacy module we can easily remove punctuation from our text.




from textacy import preprocessing
 
 
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
 
print(rm_punc)

The text used here is a randomly generated text from an external website. Firstly, we imported preprocessing class of textacy module and then used the remove and punctuation methods to remove the punctuations. 

Output:

Now is the winter of our discontent
Made glorious summer by this sun of York 
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried 
Now are our brows bound with victorious wreaths 
Our bruised arms hung up for monuments 
Our stern alarums changed to merry meetings 
Our dreadful marches to delightful measures 
Grim visaged war hath smooth d his wrinkled front 
And now  instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I  that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I  that am rudely stamp d  and want love s majesty
To strut before a wanton ambling nymph
I  that am curtail d of this fair proportion

Remove unnecessary Whitespace

We can remove unnecessary whitespaces from our text. It will remove all the extra spaces we have and cut them all to only a single space after each word.




from textacy import preprocessing
 
 
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
 
print(rm_wsp)

Here we used the normalize class and whitespace method to remove whitespaces.

Output:

In the output, we can see all the excess whitespace is being removed but the punctuations are still there. So if we want to remove that too then we can amalgamate both operations.

Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,

Removing Punctuation and Whitespace together




from textacy import preprocessing
 
 
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
 
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
 
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
 
print(rm_all)

Output:

Now is the winter of our discontent
Made glorious summer by this sun of York
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried
Now are our brows bound with victorious wreaths
Our bruised arms hung up for monuments
Our stern alarums changed to merry meetings
Our dreadful marches to delightful measures
Grim visaged war hath smooth d his wrinkled front
And now instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I that am rudely stamp d and want love s majesty
To strut before a wanton ambling nymph
I that am curtail d of this fair proportion

Partition a text

Sometimes the text we receive or use is ‘raw’ means unstructured, messy, etc, so before analysis, in the preprocessing stage, we might need to clean them up and partition them based on certain criteria.




from textacy import preprocessing
from textacy import extract
 
 
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
 
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
 
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
 
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
 
# Extracting text
ext = list(extract.keyword_in_context(
    rm_all, 'I', window_width=20, pad_context=True))
 
print(ext)

Output:

Now the output looks a bit complex because the text used here was not appropriate for this cause. But as I have used the text which was already punctuation and whitespace free we can’t see any punctuation or extra whitespace. The blank spaces created here are due to the window_width, all the whitespace that was there in the text has been removed alongside the punctuation.

[('                Now ', 'i', 's the winter of our '), 
('        Now is the w', 'i', 'nter of our disconte'), 
(' the winter of our d', 'i', 'scontent\nMade glorio'), 
('discontent\nMade glor', 'i', 'ous summer by this s'), 
('lorious summer by th', 'i', 's sun of York \nAnd a'), 
('ur d upon our house\n', 'I', 'n the deep bosom of '), 
('som of the ocean bur', 'i', 'ed \nNow are our brow'), 
('re our brows bound w', 'i', 'th victorious wreath'), 
('r brows bound with v', 'i', 'ctorious wreaths \nOu'), 
('ws bound with victor', 'i', 'ous wreaths \nOur bru'), 
('ous wreaths \nOur bru', 'i', 'sed arms hung up for'), 
('hanged to merry meet', 'i', 'ngs \nOur dreadful ma'), 
('adful marches to del', 'i', 'ghtful measures \nGri'), 
('ightful measures \nGr', 'i', 'm visaged war hath s'), 
('ful measures \nGrim v', 'i', 'saged war hath smoot'), 
(' war hath smooth d h', 'i', 's wrinkled front \nAn'), 
('hath smooth d his wr', 'i', 'nkled front \nAnd now'), 
('kled front \nAnd now ', 'i', 'nstead of mounting b'), 
('now instead of mount', 'i', 'ng barded steeds\nTo '), 
(' barded steeds\nTo fr', 'i', 'ght the souls of fea'), 
(' of fearful adversar', 'i', 'es \nHe capers nimbly'), 
('rsaries \nHe capers n', 'i', 'mbly in a lady s cha'), 
('s \nHe capers nimbly ', 'i', 'n a lady s chamber\nT'), 
(' chamber\nTo the lasc', 'i', 'vious pleasing of a '), 
('hamber\nTo the lasciv', 'i', 'ous pleasing of a lu'), 
('the lascivious pleas', 'i', 'ng of a lute \nBut I '), 
('sing of a lute \nBut ', 'I', ' that am not shaped '), 
('not shaped for sport', 'i', 've tricks \nNor made '), 
('aped for sportive tr', 'i', 'cks \nNor made to cou'), 
('ourt an amorous look', 'i', 'ng glass \nI that am '), 
('rous looking glass \n', 'I', ' that am rudely stam'), 
('before a wanton ambl', 'i', 'ng nymph \nI that am '), 
('nton ambling nymph \n', 'I', ' that am curtail d o'), 
('mph \nI that am curta', 'i', 'l d of this fair pro'), 
('t am curtail d of th', 'i', 's fair proportion   '), 
('curtail d of this fa', 'i', 'r proportion        '), 
('of this fair proport', 'i', 'on                  ')]

The below section shows the result if we don’t remove the punctuation or whitespace earlier, I didn’t include the entire output as it is big and as all the punctuation is available alongside whitespace it would look messy.

[('               \nNow ', 'i', 's the winter of our '), 
('       \nNow is the w', 'i', 'nter of our      dis'), 
('winter of our      d', 'i', 'scontent\nMade glorio'), 
('discontent\nMade glor', 'i', 'ous summer by this s'), 
('lorious summer by th', 'i', 's sun of York;\nAnd a'), 
("ur'd upon our house\n", 'I', 'n the         deep b'), 
('som of the ocean bur', 'i', 'ed.\nNow are our brow').......]

Replace URLs from text with other text

We can remove any unnecessary URLs from our text and replace it with some other text –




from textacy import preprocessing
 
 
# Replace URLs
txt = "https://www.geeksforgeeks.org/ is the best place to learn anything"
rm_url = preprocessing.replace.urls(txt,"GeeksforGeeks")
 
print(rm_url)

Output:

 

Replace emails with other text




from textacy import preprocessing
 
# Replace Emails
mail = "Send me a mail in the following address - example@gmail.com"
rm_mail = preprocessing.replace.emails(mail,"UserMail")
 
print(rm_mail)

Output:

 

Replace phone number




from textacy import preprocessing
 
# Replace phone number
num = "Call me at 12345678910"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
 
print(rm_num)

Output:

 

If we pass more than one number then this will replace all of them with NUM.




from textacy import preprocessing
 
# Replace phone number
num = "Call me at 12345678910 or 7896451235"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
 
print(rm_num)

Output – 

 

Replace any number




from textacy import preprocessing
 
# Replace Number
n = "Any number like 12 or 86 , maybe 100 etc"
rm_n = preprocessing.replace.numbers(n,"Numbers")
 
print(rm_n)

Output:

 

Remove texts surrounded by Brackets and the brackets too:




from textacy import preprocessing
 
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling (from the start plus new capabilities in Python 3.11)"""
 
print(preprocessing.remove.brackets(txt))

Output:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica  in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling

 

We can also pass an keyworded argument called only and pass a list of type brackets we only want to be removed. It supports three values square, curly  , round.




from textacy import preprocessing
 
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the [ABC programming language], which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}"""
 
print(preprocessing.remove.brackets(txt,only=["round","square"]))

Output:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica  in the Netherlands
as a successor to the , which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}

 


Article Tags :