NLP | Regex and Affix tagging

Regular expression matching is used to tag words. Consider the example, numbers can be matched with \d to assign the tag CD (which refers to a Cardinal number). Or one can match the known word patterns, such as the suffix “ing”.

Understanding the concept –

  • RegexpTagger is a subclass of SequentialBackoffTagger. It can be positioned before a DefaultTagger class so as to tag words that the n-gram tagger(s) missed and thus can be a useful part of a backoff chain.
  • At initialization, patterns are saved in RegexpTagger class. choose_tag() is then called, it iterates over the patterns. Then, it returns the first expression tag that can match the current word using re.match().
  • So, if the two given expressions get matched, then the tag of the first one will be returned without even trying the second expression.
  • If the given pattern is like – (r’.*’, ‘NN’), RegexpTagger class can replace the DefaultTagger class

Code #1 : Python regular expression module and re syntax



filter_none

edit
close

play_arrow

link
brightness_4
code

patterns = [(r'^\d+$', 'CD'),
            # gerunds, i.e. wondering
            (r'.*ing$', 'VBG'), 
            # i.e. wonderment
            (r'.*ment$', 'NN'),
            # i.e. wonderful
            (r'.*ful$', 'JJ')]

chevron_right


RegexpTagger class expects a list of two tuples

-> first element in the tuple is a regular expression
-> second element is the tag

 
Code #2 : Using RegexpTagger

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading Libraries
from tag_util import patterns
from nltk.tag import RegexpTagger
from nltk.corpus import treebank
  
test_data = treebank.tagged_sents()[3000:]
  
tagger = RegexpTagger(patterns)
print ("Accuracy : ", tagger.evaluate(test_data))

chevron_right


Output :

Accuracy : 0.037470321605870924

What is Affix tagging ?
It is a subclass of ContextTagger. In the case of AffixTagger class, the context is either the suffix or the prefix of a word. So, it clearly indicates that this class can learn tags based on fixed-length substrings of the beginning or end of a word.
It specifies the three-character suffixes. That words must be at least 5 characters long and None is returned as the tag if a word is less than five character.

Code #3 : Understanding AffixTagger.

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import AffixTagger
  
# intializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
  
print ("Train data : \n", train_data[1])
  
# Intializing tagger
tag = AffixTagger(train_data)
  
# Testing
print ("\nAccuracy : ", tag.evaluate(test_data))

chevron_right


Output :

Train data :  
[('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), 
('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (', ', ', '), ('the', 'DT'),
('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')]

Accuracy : 0.27558817181092166

 

Code #4 : AffixTagger by specifying 3 character prefixes.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Specifying 3 character prefixes
prefix_tag = AffixTagger(train_data, 
                         affix_length = 3)
  
# Testing
accuracy = prefix_tag.evaluate(test_data)
  
print ("Accuracy : ", accuracy)

chevron_right


Output :

Accuracy : 0.23587308439456076

 
Code #5 : AffixTagger by specifying 2-character suffixes

filter_none

edit
close

play_arrow

link
brightness_4
code

# Specifying 2 character sufffixes
sufix_tag = AffixTagger(train_data, 
                         affix_length = -2)
  
# Testing
accuracy = sufix_tag.evaluate(test_data)
  
print ("Accuracy : ", accuracy)

chevron_right


Output :

Accuracy : 0.31940427368875457


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.