NLP | IOB tags

What are Chunks ?
Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks.

What are IOB tags ?
It is a format for chunks. These tags are similar to part-of-speech tags but provide can denote the inside, utside, and the beginning of a chunk. Not just noun phrase but multiple different chunk phrase types are allowed here.

Example : It is an excerpt from the conll2000 corpus. Each word is is with a part-of-speech tag followed by an IOB tag on its own line:

Mr. NNP B-NP
Meador NNP I-NP
had VBD B-VP
been VBN I-VP
executive JJ B-NP
vice NN I-NP
president NN I-NP
of IN B-PP
Balcor NNP B-NP

What it means ?
B-NP : beginning of a noun phrase
I-NP : descibes that the word is inside of the current noun phrase.
O : end of the sentence.
B-VP and I-VP : beginning and inside of a verb phrase.

Code #1 : How it works – chunking words with IOB tags.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
  
# Initailizing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
  
reader.chunked_words()
  
reader.iob_words()

chevron_right


Output :

[Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), 
('been', 'VBN')]), ...]

[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ...]

Code #2 : How it works – chunking sentence with IOB tags.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
  
# Initailizing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
  
reader.chunked_sents()
  
reader.iob_sents()

chevron_right


Output :

[Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]),
Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), 
Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]),
Tree('PP', [('of', 'IN')]), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]

[[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ('had', 'VBD', 'B-VP'), 
('been', 'VBN', 'I-VP'), ('executive', 'JJ', 'B-NP'), ('vice', 'NN', 'I-NP'), 
('president', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('Balcor', 'NNP', 'B-NP'), 
('.', '.', 'O')]]

Let’s understand the code above :

  • For reading the corpus with IOB format, ConllChunkCorpusReader class is used.
  • No separation of paragraphs and each sentence is separated by a blank line, therefore para_* methods are not available.
  • Tuple or list specifying the types of chunks in the file like (‘NP’, ‘VP’, ‘PP’) sreves as the third argument to ConllChunkCorpusReader.
  • iob_words() and iob_sents() methods returns lists of three tuples of (word, pos, iob)

Code #3 : Tree Leaves – i.e. the tagged tokens

filter_none

edit
close

play_arrow

link
brightness_4
code

# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
  
# Initailizing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
  
reader.chunked_words()[0].leaves()
  
reader.chunked_sents()[0].leaves()
  
reader.chunked_paras()[0][0].leaves()

chevron_right


Output :

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'),
('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'),
('jobs', 'NNS'), (', ', ', '), ('the', 'DT'), ('spokesman', 'NN'),
('said', 'VBD'), ('.', '.')]

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'),
('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'),
('jobs', 'NNS'), (', ', ', '), ('the', 'DT'), ('spokesman', 'NN'),
('said', 'VBD'), ('.', '.')]


My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.