Open In App

Keyword Extraction Methods in NLP

Keyword extraction is a vital task in Natural Language Processing (NLP) for identifying the most relevant words or phrases from text, and enhancing insights into its content. The article explores the basics of keyword extraction, its significance in NLP, and various implementation methods using Python libraries like NLTK, TextRank, RAKE, YAKE, and KeyBERT.

Significance of Keyword Extraction in NLP

Keyword extraction is a technique used to identify and extract the most relevant words or phrases from a piece of text. The significance of keyword extraction in natural language processing (NLP) discussed below:

Libraries Required for Keyword Extraction

KeyWord Extraction using TestRank

TestRank is an algorithm used for keyword extraction in the context of natural language processing (NLP) and information retrieval. It was developed to improve upon traditional methods of keyword extraction by considering the structure and content of a document in a more relevant way.

Implementation of Textrank Using Python

  1. Install PyTextRank: !pip3 install pytextrank installs the PyTextRank package.
  2. Setup spaCy and PyTextRank: Load spaCy's English model and add PyTextRank to the pipeline with nlp = spacy.load("en_core_web_sm") and nlp.add_pipe("textrank").
  3. Process Text: Process the text using doc = nlp("TextRank is a keyword extraction algorithm...") to apply TextRank.
  4. Extract Keywords: Iterate over doc._.phrases[:10] to print the top-ranked phrases, which are the extracted keywords.
# Installation
!pip3 install pytextrank

import spacy
import pytextrank

# example text
text = "TextRank is a keyword extraction algorithm based on PageRank and has been widely used in natural language processing tasks."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)

# examine the top-ranked phrases in the document
for phrase in doc._.phrases[:10]:
print(phrase.text)

Output:

PageRank
a keyword extraction algorithm
TextRank

Keyword Extraction using KeyBERT

Using BERT embeddings, KeyBERT is a simple and intuitive keyword extraction method that finds the most related keywords and keyphrases in a given document. It utilizes simple cosine similarity and BERT embeddings to identify sub-documents in the text that are similar to the document overall.

Implementation of KeyBert model Using Python

#Installation

!pip install keybert
from keybert import KeyBERT

# Initialize the KeyBERT model
model = KeyBERT('distilbert-base-nli-mean-tokens')

# Example text
text = """
         Transformers provides thousands of pre-trained models to perform tasks on texts such as classification, 
         information extraction, question answering, summarization, translation, text generation, etc. 
         Each architecture is designed with a specific task in mind.
       """

# Extract keywords
keywords = model.extract_keywords(text)

# Print the keywords
print("Keywords:")
for keyword in keywords:
    print(keyword)

Output:

Keywords:
('transformers', 0.3629)
('trained', 0.2314)
('thousands', 0.2114)
('architecture', 0.1905)
('perform', 0.1793)

Keyword Extraction using RAKE

RAKE (Rapid Automatic Keyword Extraction) is used to automatically extract keywords and important phrases from text texts. This approach is implemented in Python using the rake_nltk module, which makes use of the Natural Language Toolkit (NLTK), a well-known natural language processing package.

  1. The text is cleaned up to eliminate any superfluous characters, such as punctuation and stopwords (often used words like "the," "is," etc.), before keywords are extracted.
  2. Next, the text is divided into individual words and phrases (candidate keywords). Usually, to accomplish this, the text is divided according to punctuation and whitespace.
  3. Each possible keyword is given a score based on how frequently it appears in the text and how frequently it occurs alongside other words. Words that occur frequently but are not stopwords, and that occur close to one another, receive higher ratings from RAKE.
  4. Lastly, a ranking based on the scores of the potential keywords is determined. The ultimate output is determined by ranking the top-ranked keywords.

Implementation of Rake_NLTK model Using Python

# installation
!pip3 install rake-nltk

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from rake_nltk import Rake

# Create a Rake instance
r = Rake()

# Text from which keywords will be extracted
text = "RAKE (Rapid Automatic Keyword Extraction) is a keyword extraction algorithm that automatically identifies relevant keywords and phrases in a text document."

# Extract keywords from the text
r.extract_keywords_from_text(text)

# Get the ranked keywords
keywords = r.get_ranked_phrases_with_scores()

# Print the extracted keywords and their scores
for score, kw in keywords:
    print("Keyword:", kw, "Score:", score)

Output:

Keyword: automatically identifies relevant keywords Score: 16.0
Keyword: rapid automatic keyword extraction Score: 15.0
Keyword: keyword extraction algorithm Score: 10.0
Keyword: text document Score: 4.0
Keyword: rake Score: 1.0
Keyword: phrases Score: 1.0

Keyword Extraction using YAKE

YAKE (Yet Another Keyword Extractor) is another keyword extraction algorithm similar to RAKE, but with some enhancements and differences in its approach (Unsupervised Approach). YAKE focuses on extracting keywords from text documents based on their statistical significance and contextual relevance. It is designed to be language-independent and works well across different domains and languages.

  1. Text Preprocessing: Similar to RAKE, YAKE starts by removing unnecessary punctuation, stopwords, and other parts of the text.
  2. Candidate Selection: YAKE identifies potential keywords or key phrases from the preprocessed text. YAKE may also consider multi-word expressions and collocations as potential candidates.
  3. Term Scoring: Based on its statistical relevance within the paper, each potential keyword or key phrase is given a score. To determine each term's relevance, YAKE employs a statistical measure like TF-IDF (Term Frequency-Inverse Document Frequency). To improve the scoring procedure, YAKE may also take consider additional factors like word position and term co-occurrence.
  4. Keyword Extraction: YAKE then uses their scores to rank the candidate keywords and chooses the highest-scoring keywords as the output. In order to enhance the overall quality of the retrieved keywords and eliminate superfluous or irrelevant terms, YAKE may additionally use post-processing techniques.

Implementation of YAKE model Using Python

  1. Importing the KeywordExtractor class: From the yake module, you import the KeywordExtractor class, which is used for extracting keywords from text.
  2. Creating a KeywordExtractor instance: You create an instance of the KeywordExtractor class, which will be used to extract keywords.
  3. Defining the text: You define a sample text from which keywords will be extracted.
  4. Extracting keywords: You use the extract_keywords method of the kw_extractor instance to extract keywords from the text. The extracted keywords are stored in the keywords variable.
  5. Printing the keywords: You iterate over the keywords list and print each keyword along with its score.
#installation 
!pip install yake
from yake import KeywordExtractor

# Create a KeywordExtractor instance
kw_extractor = KeywordExtractor()

# Text from which keywords will be extracted
text = "YAKE (Yet Another Keyword Extractor) is a Python library for extracting keywords from text."

# Extract keywords from the text
keywords = kw_extractor.extract_keywords(text)

# Print the extracted keywords and their scores
for kw in keywords:
    print("Keyword:", kw[0], "Score:", kw[1])

Output:

Keyword: Keyword Extractor Score: 0.010798580847666514
Keyword: Python library Score: 0.017391962598404163
Keyword: extracting keywords Score: 0.03240529839631463
Keyword: YAKE Score: 0.034026762452505785
Keyword: library for extracting Score: 0.03498702377830618
Keyword: keywords from text Score: 0.046998648139851405
Keyword: Extractor Score: 0.06257809066078279
Keyword: Python Score: 0.0929767246050301
Keyword: text Score: 0.11246769819744629
Keyword: Keyword Score: 0.17071817227943928
Keyword: keywords Score: 0.17071817227943928
Keyword: library Score: 0.1838594885424691
Keyword: extracting Score: 0.1838594885424691

Keyword Extraction using Spacy

spaCy is an open-source library for advanced natural language processing (NLP) in Python. It's designed to be fast, efficient, and user-friendly, making it suitable for building applications that process and understand large volumes of text.

Implementation of Spacy model Using Python

  1. Importing libraries: import spacy imports the spaCy library, from collections import Counter imports the Counter class from the collections module, and from string import punctuation imports the punctuation constant from the string module.
  2. Loading spaCy model: nlp = spacy.load("en_core_web_sm") loads the English language model en_core_web_sm provided by spaCy.
  3. Defining get_hotwords function: This function takes a text as input, tokenizes it using spaCy, and filters out tokens that are stop words or punctuation. It then selects tokens that are proper nouns, adjectives, or nouns and returns them as a list.
  4. Processing text: The input text is processed using the get_hotwords function, and the result is stored in the output variable as a set to remove duplicates.
  5. Finding most common hotwords: The Counter(output).most_common(10) line creates a Counter object from the output set and finds the 10 most common hotwords in the text.
  6. Printing the most common hotwords: The for loop iterates over the most common hotwords (most_common_list) and prints each hotword along with its count.
#installation
!pip3 install spacy
import spacy
from collections import Counter
from string import punctuation
nlp = spacy.load("en_core_web_sm")
def get_hotwords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN'] 
    doc = nlp(text.lower()) 
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            result.append(token.text)
    return result
text = """
spaCy is an open-source natural language processing library for Python.
"""
output = set(get_hotwords(text))
most_common_list = Counter(output).most_common(10)
for item in most_common_list:
  print(item[0])

Output:

library
processing
open
spacy
python
source
natural
language

Conclusion

Keyword extraction is a crucial task in natural language processing (NLP) that helps identify the most relevant words or phrases from text, enhancing insights into its content. This article explored the basics of keyword extraction, its significance in NLP, and various implementation methods using Python libraries like NLTK, TextRank, RAKE, YAKE, and KeyBERT.

Article Tags :