Conditional Random Fields (CRFs) for POS tagging in NLP

Part of Speech tagging is one of the tasks on which early Language models were tested for the GLUE score. In this article, we will learn about one such method which can be used for POS tagging. But before that let us understand what is POS tagging.

What is POS tagging?

Part-of-speech (POS) tagging is the process of assigning grammatical categories, such as nouns, verbs, adjectives, etc., to each word in a sentence. POS tagging is a fundamental task in Natural Language Processing (NLP) and is used in various applications, such as machine translation, sentiment analysis, and text-to-speech synthesis.

Here’s an example of POS tagging for the sentence “She likes to read books”:

Word	POS Tag
She	PRON
likes	VERB
to	PART
read	VERB
books	NOUN

In this example, the word “She” is tagged as a pronoun, “likes” is tagged as a verb, “to” is tagged as a particle, “read” is tagged as a verb, and “books” is tagged as a noun. The POS tags provide information about the syntactic structure of the sentence, which can be used in downstream tasks, such as parsing or sentiment analysis.

Conditional Random Fields

A Conditional Random Field (CRF) is a type of probabilistic graphical model often used in Natural Language Processing (NLP) and computer vision tasks. It is a variant of a Markov Random Field (MRF), which is a type of undirected graphical model.

CRFs are used for structured prediction tasks, where the goal is to predict a structured output based on a set of input features. For example, in NLP, a commonly structured prediction task is Part-of-Speech (POS) tagging, where the goal is to assign a part-of-speech tag to each word in a sentence. CRFs can also be used for Named Entity Recognition (NER), chunking, and other tasks where the output is a structured sequence.
CRFs are trained using maximum likelihood estimation, which involves optimizing the parameters of the model to maximize the probability of the correct output sequence given the input features. This optimization problem is typically solved using iterative algorithms like gradient descent or L-BFGS.
The formula for a Conditional Random Field (CRF) is similar to that of a Markov Random Field (MRF) but with the addition of input features that condition the probability distribution over output sequences.

Let X be the input features and Y be the output sequence. The joint probability distribution of a CRF is given by:

where:

Z(X) is the normalization factor that ensures the distribution sums to 1 over all possible output sequences.
λ_k are the learned model parameters.
f_k(y_i – 1, y_i, x_i) are the feature functions that take as input the current output state y_i, the previous output state y_i – 1, and the input features x_i.
These functions can be binary or real-valued, and capture dependencies between the input features and the output sequence.

Here’s an example of using Conditional Random Fields (CRFs) for POS tagging in Python using the sklearn_crfsuite library. First, you’ll need to install the sklearn_crfsuite library using ‘pip’:

pip install sklearn-crfsuite

‘sklearn-crfsuite’ is a Python library that provides an interface to the CRFsuite implementation of Conditional Random Fields (CRFs), a popular machine learning algorithm for sequence labeling tasks such as Part-Of-Speech (POS) tagging and named entity recognition (NER). The library is built on top of scikit-learn, a popular machine-learning library for Python.

Python3

import nltk

import sklearn_crfsuite

from sklearn_crfsuite import metrics

Then, you can load a dataset of tagged sentences. For example:

Python3

# Load the Penn Treebank corpus

nltk.download('treebank')

corpus = nltk.corpus.treebank.tagged_sents()

print(corpus)

Output:

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'),
 ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB')......

In this article we are using treebank corpus, you can use your own dataset.

Define Feature function.

In order to convert a sentence into a sequence of features that can be used as input to a CRF model, you can define a feature function that extracts relevant information from each word in the sentence. Here’s an example feature function that extracts the following features for each word in the sentence:

The word itself.
The word is in lowercase.
The word is in uppercase.
The length of the word.
Whether the word contains a hyphen.
Whether the word is the first word in the sentence.
Whether the word is the last word in the sentence.
The previous word in the sentence.
The next word in the sentence.

Python3

# Define a function to extract features for each word in a sentence

def word_features(sentence, i):

    word = sentence[i][0]

    features = {

        'word': word,

        'is_first': i == 0, #if the word is a first word

        'is_last': i == len(sentence) - 1,  #if the word is a last word

        'is_capitalized': word[0].upper() == word[0],

        'is_all_caps': word.upper() == word,      #word is in uppercase

        'is_all_lower': word.lower() == word,      #word is in lowercase

         #prefix of the word

        'prefix-1': word[0],   

        'prefix-2': word[:2],

        'prefix-3': word[:3],

         #suffix of the word

        'suffix-1': word[-1],

        'suffix-2': word[-2:],

        'suffix-3': word[-3:],

         #extracting previous word

        'prev_word': '' if i == 0 else sentence[i-1][0],

         #extracting next word

        'next_word': '' if i == len(sentence)-1 else sentence[i+1][0],

        'has_hyphen': '-' in word,    #if word has hypen

        'is_numeric': word.isdigit(),  #if word is in numeric

        'capitals_inside': word[1:].lower() != word[1:]

    }

    return features

Note that this is just an example feature function and the features you extract may vary depending on your specific use case. You can customize this function to extract any features that you think will be relevant to your sequence labeling task. The next step is splitting the dataset into a train set and a test set.

Python3

# Extract features for each sentence in the corpus

X = []

y = []

for sentence in corpus:

    X_sentence = []

    y_sentence = []

    for i in range(len(sentence)):

        X_sentence.append(word_features(sentence, i))

        y_sentence.append(sentence[i][1])

    X.append(X_sentence)

    y.append(y_sentence)
 
# Split the data into training and testing sets

split = int(0.8 * len(X))

X_train = X[:split]

y_train = y[:split]

X_test = X[split:]

y_test = y[split:]

Now, let’s train the CRF model.

Python3

# Train a CRF model on the training data

crf = sklearn_crfsuite.CRF(

    algorithm='lbfgs',

    c1=0.1,

    c2=0.1,

    max_iterations=100,

    all_possible_transitions=True
)
crf.fit(X_train, y_train)
 
# Make predictions on the test data and evaluate the performance

y_pred = crf.predict(X_test)
 
print(metrics.flat_accuracy_score(y_test, y_pred))

Output:

0.9631718149608264

‘sklearn_crfsuite.CRF()’ is a class in the sklearn-crfsuite Python library that represents a Conditional Random Fields (CRF) model. It is used to train and evaluate CRF models for sequence labeling tasks such as Part-Of-Speech (POS) tagging and named entity recognition (NER).

The CRF() class constructor takes several parameters:

algorithm: The optimization algorithm to use for training the CRF model. Possible values are ‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, and ‘arow’. The default is ‘lbfgs’.
c1: The L1 regularization parameter for the CRF model. The default is 1.0.
c2: The L2 regularization parameter for the CRF model. The default is 1e-3.
max_iterations: The maximum number of iterations to run the optimization algorithm. The default is 100.
all_possible_transitions: Whether to include all possible state transitions in the CRF model. The default is False.
verbose: Whether to output progress messages during training. The default is False.

Another way to train a CRF model is to use ‘pycrfsuite.Trainer()’ which is a part of the python-crfsuite library. The ‘pycrfsuite.Trainer()’ is used for training the CRF model. Let’s see its implementation,

Python3

import pycrfsuite
 
# Train a CRF model suing pysrfsuite

trainer = pycrfsuite.Trainer(verbose=False)

for x, y in zip(X_train, y_train):

    trainer.append(x, y)
trainer.set_params({

    'c1': 1.0,

    'c2': 1e-3,

    'max_iterations': 50,

    'feature.possible_transitions': True
})

trainer.train('pos.crfsuite')
 
# Tag a new sentence

tagger = pycrfsuite.Tagger()

tagger.open('pos.crfsuite')

sentence = 'Geeksforgeeks is a best platform for students.'.split()

features = [word_features(sentence, i) for i in range(len(sentence))]

tags = tagger.tag(features)

print(list(zip(sentence, tags)))

Output:

[('Geeksforgeeks', 'MD'), ('is', 'VB'), ('a', 'DT'), ('best', 'JJ'),
 ('platform', 'NN'), ('for', 'NN'), ('students.', 'NNS')]

The ‘pycrfsuite.Tagger()’ is used for applying the trained model for prediction.

Conclusion

CRFs have been shown to be effective for POS tagging in various languages, including English, Chinese, and Arabic. They are also used in other NLP tasks, such as named entity recognition and syntactic parsing.

Article Tags :

AI-ML-DS

NLP