Python | Parse a website with regex and urllib

Let’s discuss the concept of parsing using python. In python we have lot of modules but for parsing we only need urllib and re i.e regular expression. By using both of these libraries we can fetch the data on web pages.

Note that parsing of websites means that fetch the whole source code and that we want to search using a given url link, it will give you the output as the bulk of HTML content that you can’t understand. Let’s see the demonstration with an explanation to let you understand more about parsing.

Code #1: Libraries needed

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing libraries
import urllib.request
import urllib.parse
import re

chevron_right


Code #2:

filter_none

edit
close

play_arrow

link
brightness_4
code

values = {'s':'python programming',
          'submit':'search'}

chevron_right


We have defined a url and some related values that we want to search. Remember that we define values as a dictionary and in this key value pair we define python programming to search on the defined url.



Code #3:

filter_none

edit
close

play_arrow

link
brightness_4
code

data = urllib.parse.urlencode(values)        
data = data.encode('utf-8')                  
req = urllib.request.Request(url, data)      
resp = urllib.request.urlopen(req)    
         
respData = resp.read()                      

chevron_right


In the first line we encode the values that we have defined earlier, then (line 2) we encode the same data that is understand by machine.
In 3rd line of code we request for values in the defined url, then use the module urlopen() to open the web document that HTML.
In the last line read() will help read the document line by line and assign it to respData named variable.

Code #4:

filter_none

edit
close

play_arrow

link
brightness_4
code

paragraphs = re.findall(r'<p>(.*?)</p>', str(respData))
  
for eachP in paragraphs:
    print(eachP)

chevron_right


In order to extract the relevant data we apply regular expression. Second argument must be type string and if we want to print the data we apply simple print function.
 
Below are few examples:

Example #1:

filter_none

edit
close

play_arrow

link
brightness_4
code

import urllib.request
import urllib.parse
import re
   
values = {'s':'python programming',
          'submit':'search'}
   
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
resp = urllib.request.urlopen(req)
respData = resp.read()
   
paragraphs = re.findall(r'<p>(.*?)</p>',str(respData))
   
for eachP in paragraphs:
    print(eachP)

chevron_right


Output:

 

Example #2:

filter_none

edit
close

play_arrow

link
brightness_4
code

import urllib.request
import urllib.parse
import re
   
values = {'s':'pandas',
          'submit':'search'}
   
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
resp = urllib.request.urlopen(req)
respData = resp.read()
   
paragraphs = re.findall(r'<p>(.*?)</p>',str(respData))
   
for eachP in paragraphs:
    print(eachP)

chevron_right


Output:

 



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :

1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.