In the previous post, we saw the basic preprocessing steps when working with textual data. In this article, we will look at some more advanced text preprocessing techniques. We can use these techniques to gain more insights into the data that we have.
Let’s import the necessary libraries.
Part of Speech Tagging:
The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.
Input: ‘You just gave me a scare’
Output: [(‘You’, ‘PRP’), (‘just’, ‘RB’), (‘gave’, ‘VBD’), (‘me’, ‘PRP’),
(‘a’, ‘DT’), (‘scare’, ‘NN’)]
In the given example, PRP stands for personal pronoun, RB for adverb, VBD for verb past tense, DT for determiner and NN for noun. We can get the details of all the part of speech tags using the Penn Treebank tagset.
Output: NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.
In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).
Libraries like spaCy and Textblob are more suited for chunking.
Input: ‘the little yellow bird is flying in the sky’
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)
Named Entity Recognition:
Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.
Input: ‘Bill works for GeeksforGeeks so he went to Delhi for a meetup.’
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.
- Text Preprocessing in Python | Set - 1
- Python: Convert Speech to text and text to Speech
- Convert Text and Text File to PDF using Python
- twitter-text-python (ttp) module - Python
- Set Window Icon Text in PyQt5 | setWindowIconText() method
- PyQt5 - How to set text to progress bar ?
- Formatted text in Linux Terminal using Python
- Reading and Writing to text files in Python
- Tokenize text using NLTK in python
- Convert Text to Speech in Python
- Textwrap – Text wrapping and filling in Python
- Convert Text to Speech in Python using win32com.client
- Fetching text from Wikipedia's Infobox in Python
- Text-To-Speech changing voice in Python
- Python program to extract Email-id from URL text file
- Python - Efficient Text Data Cleaning
- Python | Pandas Series.str.replace() to replace text in a series
- Python | Tokenize text using TextBlob
- Python | Display text to PyGame window
- Python Text To Speech | pyttsx module
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.