Extract feed details from RSS in Python

Last Updated : 24 Jan, 2021

In the article, we will be seeing how to extract feed and post details using RSS feed for a Hashnode blog. Although we are going to use it for blogs on Hashnode it can be used for other feeds as well.

RSS means Rich Site Summary and uses standard web formats to publish information that changes frequently like blog posts, news, audio, video, etc. RSS documents often know as feed which consists of text, and metadata, like time and author’s name.

Installing feed parser:

We will be using the Feedparser python library for parsing the RSS feed of the blog. It is quite a popular library for parsing blog feeds.

pip install feedparser

Let’s understand this stepwise:

Step 1: Getting RSS feed

Use the feedparser.parse() function for creating a feed object which contains parsed blog. It takes the URL of the blog feed.

Python3

# url of blog feed 
feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
  
blog_feed = feedparser.parse(feed_url) 

Step 2: Getting details from the blog.

Python3

# returns title of the blog site 
blog_feed.feed.title  
  
# returns the link of the blog 
# and number of entries(blogs) in the site. 
blog_feed.feed.link 
len(blog_feed.entries) 
  
# Details of individual blog can 
# be accessed by using attribute name 
print(blog_feed.entries[0].title) 
print(blog_feed.entries[0].link) 
print(blog_feed.entries[0].author) 
print(blog_feed.entries[0].published) 
  
# Getting lists of tags and authors. 
tags = [tag.term for tag in blog_feed.entries[0].tags] 
authors= [author.name for author in blog_feed.entries[0].authors]

Below is the full implementation: Now use the above code to write a function that takes the link of RSS feed and returns the details.

Python3

def get_posts_details(rss=None): 
    
    """ 
    Take link of rss feed as argument 
    """
    if rss is not None: 
        
          # import the library only when url for feed is passed 
        import feedparser 
          
        # parsing blog feed 
        blog_feed = blog_feed = feedparser.parse(rss) 
          
        # getting lists of blog entries via .entries 
        posts = blog_feed.entries 
          
        # dictionary for holding posts details 
        posts_details = {"Blog title" : blog_feed.feed.title, 
                        "Blog link" : blog_feed.feed.link} 
          
        post_list = [] 
          
        # iterating over individual posts 
        for post in posts: 
            temp = dict() 
              
            # if any post doesn't have information then throw error. 
            try: 
                temp["title"] = post.title 
                temp["link"] = post.link 
                temp["author"] = post.author 
                temp["time_published"] = post.published 
                temp["tags"] = [tag.term for tag in post.tags] 
                temp["authors"] = [author.name for author in post.authors] 
                temp["summary"] = post.summary 
            except: 
                pass
              
            post_list.append(temp) 
          
        # storing lists of posts in the dictionary 
        posts_details["posts"] = post_list  
          
        return posts_details # returning the details which is dictionary 
    else: 
        return None
  
if __name__ == "__main__": 
  import json 
  
  feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
  
  data = get_posts_details(rss = feed_url) # return blogs data as a dictionary 
    
  if data: 
    # printing as a json string with indentation level = 2 
    print(json.dumps(data, indent=2))  
  else: 
    print("None")