World Wide Web holds large amounts of data available that is consistently growing both in quantity and to a fine form. Python API allows us to collect data/information of interest from the World Wide Web. API is a very useful tool for data scientists, web developers, and even any casual person who wants to find and extract information programmatically.
API vs Web Scraping
Well, most of the websites provide APIs to share data in a structured format, however, they typically restrict the data that is available and also might put a limit on how frequently it can be accessed. Additionally, a website developer might change, remove, or restrict the backend API.
On other hand, there are websites that do not provide API to share the data. The website development team at any time can change, remove, or restrict backend API. In short, we cannot rely on APIs to access the online data we may want. Therefore, we may need to rely on web scraping techniques.
When it comes to effective API, Python is usually the programming language of choice. It is easy to use a programming language that has a very rich ecosystem of tools for many tasks. If you program in other languages, you will find it easy to pick up Python and you may never go back.
The Python Software Foundation has announced Python 2 will be phased out of development and support in 2020. For this reason, We will use Python 3 and Jupyter notebook through the post. To be more specific, my python version is :
Structure of Target website
Before attempting to access the content of a website by API or web crawling, we should always develop an understanding of the structure of our target website. The sitemap and robots.txt of a website help us with some vital information apart from external tools such as Google Search and WHOIS.
Validating robots.txt file
Well, websites (most of them) define a robots.txt file to note the users about the restrictions, when accessing their website. However, these restrictions are guidelines only, and highly recommend respecting their guidelines. You should always validate and respect the contents inside the robots.txt to understand the structure of the website and minimize the chance of being blocked.
The robots.txt file is a valuable resource to validate before taking a decision to write a web crawler program or to use an API.
Understanding the problem
The following content (first few lines only) is from the robots.txt file of the website – https://github.com/robots.txt.
From the file it is clear, Github wants to use its contents using an API. One way of solving our problem is by putting our search criteria in the Github search box and pressing enter, however, it is a manual activity.
Helpfully, Github exposes this search capability as an API we can consume from our own applications. Github’s Search API gives us access to the built-in search function. This includes the use of logical and scoping operators, like “or” and “user”.
Before we jump into the code, there is something you should know about public repositories, private repositories, and access restrictions. Public repositories are usually open to the public with no restrictions while private repositories are restricted only to the owners and to the collaborators they choose.
Step 1: Validating with cURL.
Now let’s quickly validate the access to Github before putting the effort into writing an API. So to do that cURL, a simple command-line HTTP tool, is a perfect fit. cURL is usually installed on most of the Linux machines if not, you can easily do it using. – yum install curl
For windows, get a copy from “https://curl.haxx.se/download.html”.
Now run the command as shown below:
The cURL has given us a lot of information:
- HTTP/1.1 200 OK – code When your request destination URL and associated parameters are correct, GitHub will respond with a 200 status(Success).
- X-RateLimit-Limit – The maximum number of requests you’re permitted to make per hour.
- X-RateLimit-Remaining – The number of requests remaining in the current rate limit window.
- X-RateLimit-Reset – the time at which the current rate limit window resets in UTC epoch seconds.
- “repository_search_url“: This is the one we will be using in this post to query the repositories.
Step 2: Authentication
Usually, there are a couple of ways to authenticate when making a request to the Github API – using username and passwords (HTTP Basic) and using OAuth tokens. The authentication details will not be covered in this post.
Since Github allows us to access the public content without any authentication, we will stick to searching public repositories without API. It means that we are going to write an API that doesn’t require authentication, so we will be searching public repositories only.
Step 3: Github Response with Python
Github is currently on the third version of its API, so defined headers for the API call that ask explicitly to use the 3rd version of the API. Feel free to always check out the latest version here – https://docs.github.com/en/free-pro-team@latest/developers/overview/about-githubs-apis.
Then call get() and pass it the site_url and the header, the response object is assigned to the response variable. The response from Github is always a JSON. The response object has an attribute status_code, which tells whether the response is successful(200) or not.
Step 4: Converting JSON response to Python dictionary
As mentioned earlier, the response is JSON. Our JSON has three keys of which we can ignore “incomplete_results” for such a small API. A program output displayed the total repositories in Github returned for our search with response_json[‘total_count’].
Step 5: Looking at our first repository
The above code is self-explanatory. What we are doing is displaying all the keys inside the dictionary and then displaying information on our first repository.
Step 6: Loop for more…
We have looked at one repository, for more obviously we need to go through the loop.
Step 7: Visualization with Plotly
Before using you need to install Plotly package. For installation run this command into the terminal.
pip install plotly
Step 8: Creating a Presentation… Introduction…
Microsoft production especially Spreadsheets and PowerPoint presentations are ruling the world. So we are going to create a PowerPoint presentation with the Visualization graph we just created.
For installing python-pptx run this code into the terminal:
pip install python-pptx
We have first imported Presentation from ppt then create a ppt object using the Presentation class of ppt module. New slide is added with add_slide() method. The text is added using the slide.shapes.
Step 9: Saving the chart to pptx.
Now that the basics of creating a PowerPoint are covered in the above steps. Now let’s dive into the final piece of code to create a report.
Finally, we will put all the above steps discussed in a single program.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.