NLP | Regex and Affix tagging

Regular expression matching is used to tag words. Consider the example, numbers can be matched with \d to assign the tag CD (which refers to a Cardinal number). Or one can match the known word patterns, such as the suffix “ing”.

Understanding the concept –

  • RegexpTagger is a subclass of SequentialBackoffTagger. It can be positioned before a DefaultTagger class so as to tag words that the n-gram tagger(s) missed and thus can be a useful part of a backoff chain.
  • At initialization, patterns are saved in RegexpTagger class. choose_tag() is then called, it iterates over the patterns. Then, it returns the first expression tag that can match the current word using re.match().
  • So, if the two given expressions get matched, then the tag of the first one will be returned without even trying the second expression.
  • If the given pattern is like – (r’.*’, ‘NN’), RegexpTagger class can replace the DefaultTagger class

Code #1 : Python regular expression module and re syntax



filter_none

edit
close

play_arrow

link
brightness_4
code

patterns = [(r'^\d+$', 'CD'),
            # gerunds, i.e. wondering
            (r'.*ing$', 'VBG'), 
            # i.e. wonderment
            (r'.*ment$', 'NN'),
            # i.e. wonderful
            (r'.*ful$', 'JJ')]

chevron_right


RegexpTagger class expects a list of two tuples

-> first element in the tuple is a regular expression
-> second element is the tag

 
Code #2 : Using RegexpTagger

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading Libraries
from tag_util import patterns
from nltk.tag import RegexpTagger
from nltk.corpus import treebank
  
test_data = treebank.tagged_sents()[3000:]
  
tagger = RegexpTagger(patterns)
print ("Accuracy : ", tagger.evaluate(test_data))

chevron_right


Output :

Accuracy : 0.037470321605870924

What is Affix tagging ?
It is a subclass of ContextTagger. In the case of AffixTagger class, the context is either the suffix or the prefix of a word. So, it clearly indicates that this class can learn tags based on fixed-length substrings of the beginning or end of a word.
It specifies the three-character suffixes. That words must be at least 5 characters long and None is returned as the tag if a word is less than five character.

Code #3 : Understanding AffixTagger.

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import AffixTagger
  
# initializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
  
print ("Train data : \n", train_data[1])
  
# Initializing tagger
tag = AffixTagger(train_data)
  
# Testing
print ("\nAccuracy : ", tag.evaluate(test_data))

chevron_right


Output :

Train data :  
[('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), 
('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (', ', ', '), ('the', 'DT'),
('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')]

Accuracy : 0.27558817181092166

 

Code #4 : AffixTagger by specifying 3 character prefixes.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Specifying 3 character prefixes
prefix_tag = AffixTagger(train_data, 
                         affix_length = 3)
  
# Testing
accuracy = prefix_tag.evaluate(test_data)
  
print ("Accuracy : ", accuracy)

chevron_right


Output :

Accuracy : 0.23587308439456076

 
Code #5 : AffixTagger by specifying 2-character suffixes

filter_none

edit
close

play_arrow

link
brightness_4
code

# Specifying 2 character sufffixes
sufix_tag = AffixTagger(train_data, 
                         affix_length = -2)
  
# Testing
accuracy = sufix_tag.evaluate(test_data)
  
print ("Accuracy : ", accuracy)

chevron_right


Output :

Accuracy : 0.31940427368875457


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : shubham_singh