Skip to content
Related Articles

Related Articles

NLP | Filtering Insignificant Words
  • Last Updated : 26 Feb, 2019

Many of the words used in the phrase are insignificant and hold no meaning. For example – English is a subject. Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are almost useless. English subject and subject English holds the same meaning even if we remove the insignificant words – (‘is’, ‘a’). Using the nltk, we can remove the insignificant words by looking at their part-of-speech tags. For that we have to decide which Part-Of-Speech tags are significant.

Code #1 : filter_insignificant() class to filter out the insignificant words

filter_none

edit
close

play_arrow

link
brightness_4
code

def filter_insignificant(chunk, 
                         tag_suffixes =['DT', 'CC']):    
    good = []
      
    for word, tag in chunk:
        ok = True
          
    for suffix in tag_suffixes:
        if tag.endswith(suffix):
            ok = False
            break
  
        if ok:
            good.append((word, tag))
              
    return good

chevron_right


filter_insignificant() checks whether that tag ends(for each tag) with the tag_suffixes by iterating over the tagged words in the chunk. The tagged word is skipped if tag ends with any of the tag_suffixes. Else if the tag is ok, the tagged word is appended to a new good chunk that is returned.

Code #2 : Using filter_insignificant() on a phrase



filter_none

edit
close

play_arrow

link
brightness_4
code

from transforms import filter_insignificant
  
print ("Significant words : \n"
       filter_insignificant([('the', 'DT'), 
                             ('terrible', 'JJ'), ('movie', 'NN')]))

chevron_right


Output :

Significant words : 
[('terrible', 'JJ'), ('movie', 'NN')]

We can pass out different tag suffixes using filter_insignificant(). In the code below we are talking about pronouns and possessive words such as your, you, their and theirs are no good, but DT and CC words are ok. The tag suffixes would then be PRP and PRP$:
 
Code #3 : Passing in our own tag suffixes using filter_insignificant()

filter_none

edit
close

play_arrow

link
brightness_4
code

from transforms import filter_insignificant
  
# choosing tag_suffixes
print ("Significant words : \n"
       filter_insignificant([('your', 'PRP$'), 
                             ('book', 'NN'), ('is', 'VBZ'), 
                             ('great', 'JJ')], 
        tag_suffixes = ['PRP', 'PRP$']))

chevron_right


Output :

Significant words : 
[('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')]

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up
Recommended Articles
Page :