NLP | Training a tokenizer and filtering stopwords in a sentence

Why do we need to train a sentence tokenizer?
In NLTK, default sentence tokenizer works for the general purpose and it works very well. But there are chances that it won’t work best for some kind of text as that text may use nonstandard punctuation or maybe it is having a unique format. So, to handle such cases, training sentence tokenizer can result in much more accurate sentence tokenization.

Let us consider the following text for the understanding of the concept. This kind of text is very common in case of any web text corpus.

Example of TEXT:
A guy: So, what are your plans for the party?
B girl: well! I am not going!
A guy: Oh, but u should enjoy.

To download text file, click here.



Code #1 : Training Tokenizer

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading Libraries
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
  
text = webtext.raw('C:\\Geeksforgeeks\\data_for_training_tokenizer.txt')
sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
  
print(sents_1[0])
print("\n"sents_1[678])

chevron_right


Output:

'White guy: So, do you have any plans for this evening?'

'Hobo: Got any spare change?'

 
Code #2: Default Sentence Tokenizer

filter_none

edit
close

play_arrow

link
brightness_4
code

from nltk.tokenize import sent_tokenize
sents_2 = sent_tokenize(text)
  
print(sents_2[0])
print("\n"sents_2[678])

chevron_right


Output:

'White guy: So, do you have any plans for this evening?'

'Girl: But you already have a Big Mac...\r\nHobo: Oh, this is all theatrical.'

This difference in the second output is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text isn’t in the typical paragraph-sentence structure.

How training works ?
The PunktSentenceTokenizer class follows an unsupervised learning algorithm to learn what constitutes a sentence break. It is unsupervised because so one need not give any labelled training data, just raw text.

Filtering stopwords in a tokenized sentence

Stopwords are common words that are present in the text but generally do not contribute to the meaning of a sentence. They hold almost no importance for the purposes of information retrieval and natural language processing. For example – ‘the’ and ‘a’. Most search engines will filter out stop words from search queries and documents.
NLTK library comes with a stopwords corpus – nltk_data/corpora/stopwords/ that contains word lists for many languages.

Code #3 : Stopwords with Python

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading Library
from nltk.corpus import stopwords
  
# Using stopwords from English Languages
english_stops = set(stopwords.words('english'))
  
# Printing stopword list present in English
words = ["Let's", 'see', 'how', "it's", 'working']
  
print ("Before stopwords removal: ", words)
print ("\nAfter stopwords removal : ",
       [word for word in words if word not in english_stops])

chevron_right


Output:

Before stopwords removal:  ["Let's", 'see', 'how', "it's", 'working']

After stopwords removal :  ["Let's", 'see', 'working']
?

 
Code #4 : Complete list of languages used in NLTK stopwords.

filter_none

edit
close

play_arrow

link
brightness_4
code

stopwords.fileids()

chevron_right


Output:

['danish', 'dutch', 'english', 'finnish', 'french', 'german',
'hungarian', 'italian', 'norwegian', 'portuguese', 'russian',
'spanish', 'swedish', 'turkish']


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.