Python | Extractive Text Summarization using Gensim

Summarization is a useful tool for varied textual applications that aims to highlight important information within a large corpus. With the outburst of information on the web, Python provides some handy tools to help summarize a text. This article provides an overview of the two major categories of approaches followed – extractive and abstractive. In this article, we shall look at a working example of extractive summarization.

Algorithm :
Below is the algorithm implemented in the gensim library, called “TextRank”, which is based on PageRank algorithm for ranking search results.

  1. Pre-process the given text. This includes stop words removal, punctuation removal and stemming.
  2. Make a graph with sentences are the vertices.
  3. The graph has edges denoting the similarity between the two sentences at the vertices.
  4. Run PageRank algorithm on this weighted graph.
  5. Pick the highest scoring vertices and append them to the summary.
  6. Based on the ratio or the word count, the number of vertices to be picked is decided.

Code : Summarizes a Wikipedia article based on (a) ratio and (b) word count.



filter_none

edit
close

play_arrow

link
brightness_4
code

from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
import wikipedia
  
# Get wiki content.
wikisearch = wikipedia.page("Amitabh Bachchan")
wikicontent = wikisearch.content
nlp = en_core_web_sm.load()
doc = nlp(wikicontent)
  
# Save the wiki content to a file
# (for reference).
f = open("wikicontent.txt", "w")
f.write(wikicontent)
f.close()
  
# Summary (0.5% of the original content).
summ_per = summarize(wikicontent, ratio = 0.05)
print("Percent summary")
print(summ_per)
  
# Summary (200 words)
summ_words = summarize(wikicontent, word_count = 200)
print("Word count summary")
print(summ_words)

chevron_right


Output

Percent summary
Amitabh Bachchan (pronounced [?m??ta?b? ?b?t???n]; born Inquilaab Srivastava;
11 October 1942) is an Indian film actor, film producer, television host, 
occasional playback singer and former politician. He first gained popularity
in the early 1970s for films such as Zanjeer, Deewaar and Sholay, and was
dubbed India's "angry young man" for his on-screen roles in Bollywood.
.
.
.
Apart from National Film Awards, Filmfare Awards and other competitive awards
which Bachchan won for his performances throughout the years, he has been 
awarded several honours for his achievements in the Indian film industry.
Word count summary
Beyond the Indian subcontinent, he also has a large overseas following 
in markets including Africa (such as South Africa), the Middle East 
(especially Egypt), United Kingdom, Russia and parts of the United 
States. Bachchan has won numerous accolades in his career, including 
four National Film Awards as Best Actor and many awards at 
international film festivals and award ceremonies.
.
.
.
After a three year stint in politics from 1984 to 1987, Bachchan 
returned to films in 1988, playing the title role in Shahenshah, 
which was a box office success.


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.