Open In App

Conditional Random Fields (CRFs) for POS tagging in NLP

Part of Speech tagging is one of the tasks on which early Language models were tested for the GLUE score. In this article, we will learn about one such method which can be used for POS tagging. But before that let us understand what is POS tagging.

What is POS tagging?

Part-of-speech (POS) tagging is the process of assigning grammatical categories, such as nouns, verbs, adjectives, etc., to each word in a sentence. POS tagging is a fundamental task in Natural Language Processing (NLP) and is used in various applications, such as machine translation, sentiment analysis, and text-to-speech synthesis.



Here’s an example of POS tagging for the sentence “She likes to read books”:

Word POS Tag

She

PRON

likes

VERB

to

PART

read

VERB

books

NOUN


In this example, the word “She” is tagged as a pronoun, “likes” is tagged as a verb, “to” is tagged as a particle, “read” is tagged as a verb, and “books” is tagged as a noun. The POS tags provide information about the syntactic structure of the sentence, which can be used in downstream tasks, such as parsing or sentiment analysis.

Conditional Random Fields

A Conditional Random Field (CRF) is a type of probabilistic graphical model often used in Natural Language Processing (NLP) and computer vision tasks. It is a variant of a Markov Random Field (MRF), which is a type of undirected graphical model.



Let X be the input features and Y be the output sequence. The joint probability distribution of a CRF is given by:

where:

Here’s an example of using Conditional Random Fields (CRFs) for POS tagging in Python using the sklearn_crfsuite library. First, you’ll need to install the sklearn_crfsuite library using ‘pip’:

pip install sklearn-crfsuite

‘sklearn-crfsuite’ is a Python library that provides an interface to the CRFsuite implementation of Conditional Random Fields (CRFs), a popular machine learning algorithm for sequence labeling tasks such as Part-Of-Speech (POS) tagging and named entity recognition (NER). The library is built on top of scikit-learn, a popular machine-learning library for Python.

import nltk
import sklearn_crfsuite
from sklearn_crfsuite import metrics

                    

Then, you can load a dataset of tagged sentences. For example:

# Load the Penn Treebank corpus
nltk.download('treebank')
corpus = nltk.corpus.treebank.tagged_sents()
print(corpus)

                    

Output:

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'),
 ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB')......

In this article we are using treebank corpus, you can use your own dataset.

Define Feature function.

In order to convert a sentence into a sequence of features that can be used as input to a CRF model, you can define a feature function that extracts relevant information from each word in the sentence. Here’s an example feature function that extracts the following features for each word in the sentence:

# Define a function to extract features for each word in a sentence
def word_features(sentence, i):
    word = sentence[i][0]
    features = {
        'word': word,
        'is_first': i == 0, #if the word is a first word
        'is_last': i == len(sentence) - 1#if the word is a last word
        'is_capitalized': word[0].upper() == word[0],
        'is_all_caps': word.upper() == word,      #word is in uppercase
        'is_all_lower': word.lower() == word,      #word is in lowercase
         #prefix of the word
        'prefix-1': word[0],  
        'prefix-2': word[:2],
        'prefix-3': word[:3],
         #suffix of the word
        'suffix-1': word[-1],
        'suffix-2': word[-2:],
        'suffix-3': word[-3:],
         #extracting previous word
        'prev_word': '' if i == 0 else sentence[i-1][0],
         #extracting next word
        'next_word': '' if i == len(sentence)-1 else sentence[i+1][0],
        'has_hyphen': '-' in word,    #if word has hypen
        'is_numeric': word.isdigit(),  #if word is in numeric
        'capitals_inside': word[1:].lower() != word[1:]
    }
    return features

                    

Note that this is just an example feature function and the features you extract may vary depending on your specific use case. You can customize this function to extract any features that you think will be relevant to your sequence labeling task. The next step is splitting the dataset into a train set and a test set. 

# Extract features for each sentence in the corpus
X = []
y = []
for sentence in corpus:
    X_sentence = []
    y_sentence = []
    for i in range(len(sentence)):
        X_sentence.append(word_features(sentence, i))
        y_sentence.append(sentence[i][1])
    X.append(X_sentence)
    y.append(y_sentence)
 
 
# Split the data into training and testing sets
split = int(0.8 * len(X))
X_train = X[:split]
y_train = y[:split]
X_test = X[split:]
y_test = y[split:]

                    

Now, let’s train the CRF model.

# Train a CRF model on the training data
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)
 
# Make predictions on the test data and evaluate the performance
y_pred = crf.predict(X_test)
 
print(metrics.flat_accuracy_score(y_test, y_pred))

                    

Output:

0.9631718149608264

‘sklearn_crfsuite.CRF()’ is a class in the sklearn-crfsuite Python library that represents a Conditional Random Fields (CRF) model. It is used to train and evaluate CRF models for sequence labeling tasks such as Part-Of-Speech (POS) tagging and named entity recognition (NER).

The CRF() class constructor takes several parameters:

Another way to train a CRF model is to use ‘pycrfsuite.Trainer()’ which is a part of the python-crfsuite library. The ‘pycrfsuite.Trainer()’ is used for training the CRF model. Let’s see its implementation,

import pycrfsuite
 
# Train a CRF model suing pysrfsuite
trainer = pycrfsuite.Trainer(verbose=False)
for x, y in zip(X_train, y_train):
    trainer.append(x, y)
trainer.set_params({
    'c1': 1.0,
    'c2': 1e-3,
    'max_iterations': 50,
    'feature.possible_transitions': True
})
trainer.train('pos.crfsuite')
 
# Tag a new sentence
tagger = pycrfsuite.Tagger()
tagger.open('pos.crfsuite')
sentence = 'Geeksforgeeks is a best platform for students.'.split()
features = [word_features(sentence, i) for i in range(len(sentence))]
tags = tagger.tag(features)
print(list(zip(sentence, tags)))

                    

Output:

[('Geeksforgeeks', 'MD'), ('is', 'VB'), ('a', 'DT'), ('best', 'JJ'),
 ('platform', 'NN'), ('for', 'NN'), ('students.', 'NNS')]

The ‘pycrfsuite.Tagger()’ is used for applying the trained model for prediction.

Conclusion

CRFs have been shown to be effective for POS tagging in various languages, including English, Chinese, and Arabic. They are also used in other NLP tasks, such as named entity recognition and syntactic parsing.


Article Tags :