Fetching text from Wikipedia’s Infobox in Python
An infobox is a template used to collect and present a subset of information about its subject. It can be described as structured document containing a set of attribute–value pairs, and in Wikipedia, it represents a summary of information about the subject of an article.
So a wikipedia infobox is a fixed-format table usually added to the top right-hand corner of articles to represent a summary articles of that wiki page and sometimes to improve navigation to other interrelated articles.
[To know more about infobox ,Click here]
Web Scraping is a mechanism which helps to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.
There are several ways to extract information from the web. Using APIs is one of the best way to extract data from a website. Almost all large websites like Youtube Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scraping.
Sometimes, there is a need for scraping content of a Wikipedia page, while we are developing any project or using somewhere else. In this article, I’ll tell how to extract contents of the Wikipedia’s Infobox.
Basically, We can use two Python modules for scraping data:
Urllib2: It is a Python module which can be used for fetching URLs. urllib2 is a Python module for fetching URLs. It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. For more detail refer to the documentation page.
BeautifulSoup: It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraph and you can also put filters to extract information from web pages. Look at the documentation page of BeautifulSoup
BeautifulSoup does not fetch the web page for us. We can use urllib2 with BeautifulSoup library.
Now I am going to tell you a another easy way for scraping
Steps for the following:
The modules we will be using are:
- 1)lxml :lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. (You can refer this to know more about lxml module)
- 2)requests :Requests is an Apache2 Licensed HTTP library, written in Python.Requests will allow you to send HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.
For more information on it, click here
I have used Python 2.7 here,
Make sure these modules are installed on your machine.
If not then on console or prompt you can install it using pip
See this link,it will display ‘Motto section’ of this wikipedia’s page infobox.(as shown in this screenshot)
Now finally after running the program you get,
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course