NLP | IOB tags

What are Chunks ?
Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks.

What are IOB tags ?
It is a format for chunks. These tags are similar to part-of-speech tags but provide can denote the inside, utside, and the beginning of a chunk. Not just noun phrase but multiple different chunk phrase types are allowed here.

Example : It is an excerpt from the conll2000 corpus. Each word is with a part-of-speech tag followed by an IOB tag on its own line:



Mr. NNP B-NP
Meador NNP I-NP
had VBD B-VP
been VBN I-VP
executive JJ B-NP
vice NN I-NP
president NN I-NP
of IN B-PP
Balcor NNP B-NP

What it means ?
B-NP : beginning of a noun phrase
I-NP : descibes that the word is inside of the current noun phrase.
O : end of the sentence.
B-VP and I-VP : beginning and inside of a verb phrase.

Code #1 : How it works – chunking words with IOB tags.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
  
# Initailizing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
  
reader.chunked_words()
  
reader.iob_words()

chevron_right


Output :

[Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), 
('been', 'VBN')]), ...]

[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ...]

Code #2 : How it works – chunking sentence with IOB tags.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
  
# Initailizing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
  
reader.chunked_sents()
  
reader.iob_sents()

chevron_right


Output :

[Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]),
Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), 
Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]),
Tree('PP', [('of', 'IN')]), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]

[[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ('had', 'VBD', 'B-VP'), 
('been', 'VBN', 'I-VP'), ('executive', 'JJ', 'B-NP'), ('vice', 'NN', 'I-NP'), 
('president', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('Balcor', 'NNP', 'B-NP'), 
('.', '.', 'O')]]

Let’s understand the code above :

  • For reading the corpus with IOB format, ConllChunkCorpusReader class is used.
  • No separation of paragraphs and each sentence is separated by a blank line, therefore para_* methods are not available.
  • Tuple or list specifying the types of chunks in the file like (‘NP’, ‘VP’, ‘PP’) sreves as the third argument to ConllChunkCorpusReader.
  • iob_words() and iob_sents() methods returns lists of three tuples of (word, pos, iob)

Code #3 : Tree Leaves – i.e. the tagged tokens

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
  
# Initailizing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
  
reader.chunked_words()[0].leaves()
  
reader.chunked_sents()[0].leaves()
  
reader.chunked_paras()[0][0].leaves()

chevron_right


Output :

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'),
('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'),
('jobs', 'NNS'), (', ', ', '), ('the', 'DT'), ('spokesman', 'NN'),
('said', 'VBD'), ('.', '.')]

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'),
('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'),
('jobs', 'NNS'), (', ', ', '), ('the', 'DT'), ('spokesman', 'NN'),
('said', 'VBD'), ('.', '.')]


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : Akanksha_Rai