Related Articles

Related Articles

How to create a bar chart and save in pptx using Python?
  • Last Updated : 12 Nov, 2020

World Wide Web holds large amounts of data available that is consistently growing both in quantity and to a fine form. Python API allows us to collect data/information of interest from the World Wide Web. API is a very useful tool for data scientists, web developers, and even any casual person who wants to find and extract information programmatically.

API vs Web Scraping

Well, most of the websites provide APIs to share data in a structured format, however, they typically restrict the data that is available and also might put a limit on how frequently it can be accessed. Additionally, a website developer might change, remove, or restrict the backend API.

On other hand, there are websites that do not provide API to share the data. The website development team at any time can change, remove, or restrict backend API. In short, we cannot rely on APIs to access the online data we may want. Therefore, we may need to rely on web scraping techniques.

Python version

When it comes to effective API, Python is usually the programming language of choice. It is easy to use a programming language that has a very rich ecosystem of tools for many tasks. If you program in other languages, you will find it easy to pick up Python and you may never go back.

The Python Software Foundation has announced Python 2 will be phased out of development and support in 2020. For this reason, We will use Python 3 and Jupyter notebook through the post. To be more specific, my python version is :



Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

from platform import python_version
  
  
print(python_version())

chevron_right


Output

3.6.10

Structure of Target website

Before attempting to access the content of a website by API or web crawling, we should always develop an understanding of the structure of our target website. The sitemap and robots.txt of a website help us with some vital information apart from external tools such as Google Search and WHOIS.

Validating robots.txt file

Well, websites (most of them) define a robots.txt file to note the users about the restrictions, when accessing their website. However, these restrictions are guidelines only, and highly recommend respecting their guidelines. You should always validate and respect the contents inside the robots.txt to understand the structure of the website and minimize the chance of being blocked.

The robots.txt file is a valuable resource to validate before taking a decision to write a web crawler program or to use an API.

Understanding the problem

In this post, Now gather the JavaScript repositories with the highest stars from Developers Facebook famously known as Github, so let me first checkout their robots.txt file.

The following content (first few lines only) is from the robots.txt file of the website – https://github.com/robots.txt.

From the file it is clear, Github wants to use its contents using an API. One way of solving our problem is by putting our search criteria in the Github search box and pressing enter, however, it is a manual activity.



Helpfully, Github exposes this search capability as an API we can consume from our own applications. Github’s Search API gives us access to the built-in search function. This includes the use of logical and scoping operators, like “or” and “user”.

Before we jump into the code, there is something you should know about public repositories, private repositories, and access restrictions. Public repositories are usually open to the public with no restrictions while private repositories are restricted only to the owners and to the collaborators they choose.

Step 1: Validating with cURL.

Now let’s quickly validate the access to Github before putting the effort into writing an API. So to do that cURL, a simple command-line HTTP tool, is a perfect fit. cURL is usually installed on most of the Linux machines if not, you can easily do it using. – yum install curl 

For windows, get a copy from “https://curl.haxx.se/download.html”.

Now run the command as shown below:

The cURL has given us a lot of information:

  1. HTTP/1.1 200 OK – code When your request destination URL and associated parameters are correct, GitHub will respond with a 200 status(Success).
  2. X-RateLimit-Limit – The maximum number of requests you’re permitted to make per hour.
  3. X-RateLimit-Remaining – The number of requests remaining in the current rate limit window.
  4. X-RateLimit-Reset – the time at which the current rate limit window resets in UTC epoch seconds.
  5. repository_search_url“: This is the one we will be using in this post to query the repositories.

Step 2: Authentication

Usually, there are a couple of ways to authenticate when making a request to the Github API – using username and passwords (HTTP Basic) and using OAuth tokens. The authentication details will not be covered in this post.



Since Github allows us to access the public content without any authentication, we will stick to searching public repositories without API. It means that we are going to write an API that doesn’t require authentication, so we will be searching public repositories only.

Step 3: Github Response with Python

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# 1 - imports
import requests
  
# 2 - set the siteurl
  
# 3 - set the headers
headers = {'Accept': 'application/vnd.github.v3+json'}
  
# 4 - call the url with headers and save the response
response = requests.get(site_url, headers=headers)
  
# 5 - Get the response
print(f"Response from {site_url} is {response.status_code} ")

chevron_right


Output:

We started with importing requests (if it’s missing installation using pip install requests) then assigning a variable site_url with the URL of our interest. If you wanted to search for JavaScript repositories with a sorting (descending) on maximum stars.

Github is currently on the third version of its API, so defined headers for the API call that ask explicitly to use the 3rd version of the API. Feel free to always check out the latest version here – https://docs.github.com/en/free-pro-team@latest/developers/overview/about-githubs-apis.

Then call get() and pass it the site_url and the header, the response object is assigned to the response variable. The response from Github is always a JSON. The response object has an attribute status_code, which tells whether the response is successful(200) or not.

Step 4: Converting JSON response to Python dictionary

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

response_json = response.json()
print(f"keys in the Json file : {response_json.keys()}")
print(f"Total javascript repositories in GitHub : {response_json['total_count']}" )

chevron_right


Output:

As mentioned earlier, the response is JSON. Our JSON has three keys of which we can ignore “incomplete_results” for such a small API. A program output displayed the total repositories in Github returned for our search with response_json[‘total_count’].

Step 5: Looking at our first repository

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

repositories = response_json['items']
first_repo   = repositories[0]
  
print(f"Output \n  *** Repository information keys total - {len(first_repo)} - values are -\n")
for keys in sorted(first_repo.keys()):
    print(keys)
      
      
print(f" *** Repository name - {first_repo['name']}, Owner - {first_repo['owner']['login']},  total watchers - {first_repo['watchers_count']} ")

chevron_right


Output:

The above code is self-explanatory. What we are doing is displaying all the keys inside the dictionary and then displaying information on our first repository.

Step 6: Loop for more…

We have looked at one repository, for more obviously we need to go through the loop.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

for repo_info in repositories:
    print(f"\n *** Repository Name: {repo_info['name']}")
    print(f" *** Repository Owner: {repo_info['owner']['login']}")
    print(f" *** Repository Description: {repo_info['description']}")

chevron_right


Output:

Step 7: Visualization with Plotly

Time for visualization using the data we have now to show the popularity of JavaScript projects on Github. Digesting information visually is always helpful.

Before using you need to install Plotly package. For installation run this command into the terminal.

pip install plotly

Code:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# imports
import requests
from plotly.graph_objs import Bar
from plotly import offline
  
# siteurl and headers
headers = {'Accept': 'application/vnd.github.v3+json'}
  
# response and parsing the response.
response = requests.get(site_url, headers=headers)
response_json = response.json()
  
repositories = response_json['items']
  
# loop the repositories 
repo_names, repo_stars = [], []
for repo_info in repositories:
    repo_names.append(repo_info['name'])
    repo_stars.append(repo_info['stargazers_count'])
  
# graph plotting    
data_plots = [{'type' : 'bar', 'x':repo_names , 'y': repo_stars}]
layout = {'title': 'GItHubs Most Popular Javascript Projects',
          'xaxis': {'title': 'Repository'},
          'yaxis': {'title': 'Stars'}}
  
# saving graph to a Most_Popular_JavaScript_Repos.png
fig = {'data': data_plots, 'layout': layout}
offline.plot(fig, image = 'png', image_filename='Most_Popular_JavaScript_Repos')

chevron_right


The above code when executed, will save the bar-chart to a png file – Most_Popular_JavaScript_Repos under the current repository.

Step 8: Creating a Presentation… Introduction…

Microsoft production especially Spreadsheets and PowerPoint presentations are ruling the world. So we are going to create a PowerPoint presentation with the Visualization graph we just created.

For installing python-pptx run this code into the terminal:

pip install python-pptx

We will begin by creating our first slide with the title — ” Popular JavaScript Repositories in Github”.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

from pptx import Presentation 
  
# create an object ppt
ppt = Presentation()
  
# add a new slide
slide = ppt.slides.add_slide(ppt.slide_layouts[0]) 
  
# Set the Text to 
slide.shapes.title.text = "Popular JavaScript Repositories in GitHub" 
  
# save the powerpoint
ppt.save('Javascript_report.pptx')

chevron_right


Output:

We have first imported Presentation from ppt then create a ppt object using the Presentation class of ppt module. New slide is added with add_slide() method. The text is added using the slide.shapes.

Step 9: Saving the chart to pptx.

Now that the basics of creating a PowerPoint are covered in the above steps. Now let’s dive into the final piece of code to create a report.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

from pptx import Presentation
from pptx.util import Inches
from datetime import date
  
# create an Object
ppt = Presentation() 
first_slide = ppt.slides.add_slide(ppt.slide_layouts[0]) 
  
# title (inluded date)
title = "Popular JavaScript Repositories in GitHub - " + str(date.today())
  
  
# set the title on first slide
first_slide.shapes[0].text_frame.paragraphs[0].text = title
  
# slide 2 - set the image
img = 'Most_Popular_JavaScript_Repos.png'
second_slide = ppt.slide_layouts[1]
slide2 = ppt.slides.add_slide(second_slide)
  
# play with the image attributes if you are not OK with the heigth and widthh
pic = slide2.shapes.add_picture(img, left= Inches(2),top = Inches(1),height = Inches(5))
  
# save the powerpoint presentation
ppt.save('Javascript_report.pptx')

chevron_right


Output:

Finally, we will put all the above steps discussed in a single program.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import requests
from plotly.graph_objs import Bar
from plotly import offline
from pptx import Presentation
from pptx.util import Inches
from datetime import date
  
def github_api():
    # siteurl and headers
    headers = {'Accept': 'application/vnd.github.v3+json'}
  
    # response and parsing the response.
    response = requests.get(site_url, headers=headers)
    response_json = response.json()
  
    repositories = response_json['items']
  
    # loop the repositories 
    repo_names, repo_stars = [], []
    for repo_info in repositories:
        repo_names.append(repo_info['name'])
        repo_stars.append(repo_info['stargazers_count'])
  
    # graph plotting    
    data_plots = [{'type' : 'bar', 'x':repo_names , 'y': repo_stars}]
    layout = {'title': 'GItHubs Most Popular Javascript Projects',
              'xaxis': {'title': 'Repository'},
              'yaxis': {'title': 'Stars'}}
  
    # saving graph to a Most_Popular_JavaScript_Repos.png
    fig = {'data': data_plots, 'layout': layout}
    offline.plot(fig, image = 'png', image_filename='Most_Popular_JavaScript_Repos')
      
def create_pptx_report():
    # create an Object
    ppt = Presentation() 
    first_slide = ppt.slides.add_slide(ppt.slide_layouts[0]) 
  
    # title (inluded date)
    title = "Popular JavaScript Repositories in GitHub - " + str(date.today())
  
  
    # set the title on first slide
    first_slide.shapes[0].text_frame.paragraphs[0].text = title
  
    # slide 2 - set the image
    img = 'Most_Popular_JavaScript_Repos.png'
    second_slide = ppt.slide_layouts[1]
    slide2 = ppt.slides.add_slide(second_slide)
  
    # play with the image attributes if you are not OK with the heigth and widthh
    pic = slide2.shapes.add_picture(img, left= Inches(2),top = Inches(1),height = Inches(5))
  
    # save the powerpoint presentation
    ppt.save('Javascript_report.pptx')  
      
if __name__ == '__main__':
    github_api()
    create_pptx_report()

chevron_right


Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up
Recommended Articles
Page :