Keyword extraction is a vital task in Natural Language Processing (NLP) for identifying the most relevant words or phrases from text, and enhancing insights into its content. The article explores the basics of keyword extraction, its significance in NLP, and various implementation methods using Python libraries like NLTK, TextRank, RAKE, YAKE, and KeyBERT.
Table of Content
Significance of Keyword Extraction in NLP
Keyword extraction is a technique used to identify and extract the most relevant words or phrases from a piece of text. The significance of keyword extraction in natural language processing (NLP) discussed below:
- Information Retrieval: Keywords function as queries to retrieve pertinent items from extensive text collections or databases.
- Document summarization involves utilizing extracted keywords to create succinct summaries of documents, effectively encapsulating their fundamental substance.
- Text categorization and classification involve the use of keywords to determine the primary subjects or categories of documents, which aids in the process of classifying them.
- Search Engine Optimization (SEO): Keywords are essential for enhancing the visibility and ranking of site content in search engine results pages.
- Keyword extraction is an essential stage in topic modeling techniques as it aids in identifying the fundamental themes or topics present in a collection of documents.
Libraries Required for Keyword Extraction
- NLTK: provides a range of modules for text processing. These modules include routines for TF-IDF and TextRank-based keyword extraction.
- YAKE library: offers a Python version of the YAKE algorithm, which is used for unsupervised keyword extraction.
- RAKE is not a library, but it may be constructed easily using Python's string manipulation tools and basic text processing techniques.
- KeyBert: It works on a transformer model (BERT) which uses bert embeddings to identify the most similar keywords and phrases within a large document.
KeyWord Extraction using TestRank
TestRank is an algorithm used for keyword extraction in the context of natural language processing (NLP) and information retrieval. It was developed to improve upon traditional methods of keyword extraction by considering the structure and content of a document in a more relevant way.
- Text Preprocessing: The algorithm starts by preprocessing the text, which involves tasks like tokenization (breaking the text into words or phrases), removing stop words (common words like "the," "and," "is" that don't carry significant meaning), and possibly stemming or lemmatization (reducing words to their root form).
- Building a Graph: TestRank constructs a graph representation of the text, where nodes represent words or phrases, and edges represent relationships between them. These relationships could be based on co-occurrence within sentences or paragraphs, semantic similarity, or other linguistic features.
- Scoring Nodes: Each node (word or phrase) in the graph is assigned a score based on various factors. TestRank typically considers factors such as the node's degree (how many other nodes it is connected to), its centrality within the graph, its position within the document, and possibly its semantic relevance to the overall content.
- Ranking Keywords: After scoring all nodes in the graph, TestRank ranks them based on their scores. The highest-scoring nodes are considered the most important keywords or phrases in the document.
- Keyword Extraction: Finally, TestRank selects the top-ranked nodes as the extracted keywords for the document. These keywords are representative of the main themes or topics discussed in the text.
Implementation of Textrank Using Python
- Install PyTextRank:
!pip3 install pytextrank
installs the PyTextRank package. - Setup spaCy and PyTextRank: Load spaCy's English model and add PyTextRank to the pipeline with
nlp = spacy.load("en_core_web_sm")
andnlp.add_pipe("textrank")
. - Process Text: Process the text using
doc = nlp("TextRank is a keyword extraction algorithm...")
to apply TextRank. - Extract Keywords: Iterate over
doc._.phrases[:10]
to print the top-ranked phrases, which are the extracted keywords.
# Installation
!pip3 install pytextrank
import spacy
import pytextrank
# example text
text = "TextRank is a keyword extraction algorithm based on PageRank and has been widely used in natural language processing tasks."
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)
# examine the top-ranked phrases in the document
for phrase in doc._.phrases[:10]:
print(phrase.text)
Output:
PageRank
a keyword extraction algorithm
TextRank
Keyword Extraction using KeyBERT
Using BERT embeddings, KeyBERT is a simple and intuitive keyword extraction method that finds the most related keywords and keyphrases in a given document. It utilizes simple cosine similarity and BERT embeddings to identify sub-documents in the text that are similar to the document overall.
- Input Text: We provide KeyBERT with a piece of text, such as a document, article, or paragraph, from which you want to extract keywords.
- BERT Embeddings: KeyBERT utilizes BERT to convert words or phrases in the input text into high-dimensional vectors called embeddings. BERT's contextual embeddings capture the meaning of words based on the surrounding context.
- Keyword Extraction: Once the text is embedded, KeyBERT applies techniques such as cosine similarity or other clustering methods to identify the most important words or phrases in the text. These keywords represent the most salient terms that capture the main topics or themes of the input text.
- Output: KeyBERT returns a list of extracted keywords, ranked by their relevance or importance in the original text.
Implementation of KeyBert model Using Python
- The code initializes the
KeyBERT
model with the'distilbert-base-nli-mean-tokens'
model, which is a DistilBERT model fine-tuned for natural language inference tasks. - A multi-line string
text
is defined, containing the input text from which keywords will be extracted. - The
model.extract_keywords(text)
method is used to extract keywords from the input text. - The extracted keywords are then printed to the console.
#Installation
!pip install keybert
from keybert import KeyBERT
# Initialize the KeyBERT model
model = KeyBERT('distilbert-base-nli-mean-tokens')
# Example text
text = """
Transformers provides thousands of pre-trained models to perform tasks on texts such as classification,
information extraction, question answering, summarization, translation, text generation, etc.
Each architecture is designed with a specific task in mind.
"""
# Extract keywords
keywords = model.extract_keywords(text)
# Print the keywords
print("Keywords:")
for keyword in keywords:
print(keyword)
Output:
Keywords:
('transformers', 0.3629)
('trained', 0.2314)
('thousands', 0.2114)
('architecture', 0.1905)
('perform', 0.1793)
Keyword Extraction using RAKE
RAKE (Rapid Automatic Keyword Extraction) is used to automatically extract keywords and important phrases from text texts. This approach is implemented in Python using the rake_nltk module, which makes use of the Natural Language Toolkit (NLTK), a well-known natural language processing package.
- The text is cleaned up to eliminate any superfluous characters, such as punctuation and stopwords (often used words like "the," "is," etc.), before keywords are extracted.
- Next, the text is divided into individual words and phrases (candidate keywords). Usually, to accomplish this, the text is divided according to punctuation and whitespace.
- Each possible keyword is given a score based on how frequently it appears in the text and how frequently it occurs alongside other words. Words that occur frequently but are not stopwords, and that occur close to one another, receive higher ratings from RAKE.
- Lastly, a ranking based on the scores of the potential keywords is determined. The ultimate output is determined by ranking the top-ranked keywords.
Implementation of Rake_NLTK model Using Python
# installation
!pip3 install rake-nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from rake_nltk import Rake
# Create a Rake instance
r = Rake()
# Text from which keywords will be extracted
text = "RAKE (Rapid Automatic Keyword Extraction) is a keyword extraction algorithm that automatically identifies relevant keywords and phrases in a text document."
# Extract keywords from the text
r.extract_keywords_from_text(text)
# Get the ranked keywords
keywords = r.get_ranked_phrases_with_scores()
# Print the extracted keywords and their scores
for score, kw in keywords:
print("Keyword:", kw, "Score:", score)
Output:
Keyword: automatically identifies relevant keywords Score: 16.0
Keyword: rapid automatic keyword extraction Score: 15.0
Keyword: keyword extraction algorithm Score: 10.0
Keyword: text document Score: 4.0
Keyword: rake Score: 1.0
Keyword: phrases Score: 1.0
Keyword Extraction using YAKE
YAKE (Yet Another Keyword Extractor) is another keyword extraction algorithm similar to RAKE, but with some enhancements and differences in its approach (Unsupervised Approach). YAKE focuses on extracting keywords from text documents based on their statistical significance and contextual relevance. It is designed to be language-independent and works well across different domains and languages.
- Text Preprocessing: Similar to RAKE, YAKE starts by removing unnecessary punctuation, stopwords, and other parts of the text.
- Candidate Selection: YAKE identifies potential keywords or key phrases from the preprocessed text. YAKE may also consider multi-word expressions and collocations as potential candidates.
- Term Scoring: Based on its statistical relevance within the paper, each potential keyword or key phrase is given a score. To determine each term's relevance, YAKE employs a statistical measure like TF-IDF (Term Frequency-Inverse Document Frequency). To improve the scoring procedure, YAKE may also take consider additional factors like word position and term co-occurrence.
- Keyword Extraction: YAKE then uses their scores to rank the candidate keywords and chooses the highest-scoring keywords as the output. In order to enhance the overall quality of the retrieved keywords and eliminate superfluous or irrelevant terms, YAKE may additionally use post-processing techniques.
Implementation of YAKE model Using Python
- Importing the
KeywordExtractor
class: From theyake
module, you import theKeywordExtractor
class, which is used for extracting keywords from text. - Creating a
KeywordExtractor
instance: You create an instance of theKeywordExtractor
class, which will be used to extract keywords. - Defining the text: You define a sample text from which keywords will be extracted.
- Extracting keywords: You use the
extract_keywords
method of thekw_extractor
instance to extract keywords from the text. The extracted keywords are stored in thekeywords
variable. - Printing the keywords: You iterate over the
keywords
list and print each keyword along with its score.
#installation
!pip install yake
from yake import KeywordExtractor
# Create a KeywordExtractor instance
kw_extractor = KeywordExtractor()
# Text from which keywords will be extracted
text = "YAKE (Yet Another Keyword Extractor) is a Python library for extracting keywords from text."
# Extract keywords from the text
keywords = kw_extractor.extract_keywords(text)
# Print the extracted keywords and their scores
for kw in keywords:
print("Keyword:", kw[0], "Score:", kw[1])
Output:
Keyword: Keyword Extractor Score: 0.010798580847666514
Keyword: Python library Score: 0.017391962598404163
Keyword: extracting keywords Score: 0.03240529839631463
Keyword: YAKE Score: 0.034026762452505785
Keyword: library for extracting Score: 0.03498702377830618
Keyword: keywords from text Score: 0.046998648139851405
Keyword: Extractor Score: 0.06257809066078279
Keyword: Python Score: 0.0929767246050301
Keyword: text Score: 0.11246769819744629
Keyword: Keyword Score: 0.17071817227943928
Keyword: keywords Score: 0.17071817227943928
Keyword: library Score: 0.1838594885424691
Keyword: extracting Score: 0.1838594885424691
Keyword Extraction using Spacy
spaCy is an open-source library for advanced natural language processing (NLP) in Python. It's designed to be fast, efficient, and user-friendly, making it suitable for building applications that process and understand large volumes of text.
- Tokenization: Split the input text content into tokens using spaCy's tokenization capabilities.
- Filtering: Extract the "hot" words from the token list based on specific criteria (e.g., POS tags).
- POS Tagging: Identify the Part-Of-Speech (POS) tags for each token.
- Selection: Choose tokens with POS tags such as "PROPN", "ADJ", or "NOUN" as potential keywords.
- Frequency Analysis: Determine the frequency of each potential keyword.
- Top Keywords: Select the most common T number of keywords from the list.
- Print Results: Display the selected keywords.
Implementation of Spacy model Using Python
- Importing libraries:
import spacy
imports the spaCy library,from collections import Counter
imports the Counter class from the collections module, andfrom string import punctuation
imports the punctuation constant from the string module. - Loading spaCy model:
nlp = spacy.load("en_core_web_sm")
loads the English language modelen_core_web_sm
provided by spaCy. - Defining
get_hotwords
function: This function takes a text as input, tokenizes it using spaCy, and filters out tokens that are stop words or punctuation. It then selects tokens that are proper nouns, adjectives, or nouns and returns them as a list. - Processing text: The input text is processed using the
get_hotwords
function, and the result is stored in theoutput
variable as a set to remove duplicates. - Finding most common hotwords: The
Counter(output).most_common(10)
line creates a Counter object from theoutput
set and finds the 10 most common hotwords in the text. - Printing the most common hotwords: The
for
loop iterates over the most common hotwords (most_common_list
) and prints each hotword along with its count.
#installation
!pip3 install spacy
import spacy
from collections import Counter
from string import punctuation
nlp = spacy.load("en_core_web_sm")
def get_hotwords(text):
result = []
pos_tag = ['PROPN', 'ADJ', 'NOUN']
doc = nlp(text.lower())
for token in doc:
if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
continue
if(token.pos_ in pos_tag):
result.append(token.text)
return result
text = """
spaCy is an open-source natural language processing library for Python.
"""
output = set(get_hotwords(text))
most_common_list = Counter(output).most_common(10)
for item in most_common_list:
print(item[0])
Output:
library
processing
open
spacy
python
source
natural
language
Conclusion
Keyword extraction is a crucial task in natural language processing (NLP) that helps identify the most relevant words or phrases from text, enhancing insights into its content. This article explored the basics of keyword extraction, its significance in NLP, and various implementation methods using Python libraries like NLTK, TextRank, RAKE, YAKE, and KeyBERT.