How To Extract Data From Common File Formats in Python?

Last Updated : 13 Jan, 2021

Sometimes work with some datasets must have mostly worked with .csv(Comma Separated Value) files only. They are really a great starting point in applying Data Science techniques and algorithms. But many of us will land up in Data Science firms or take up real-world projects in Data Science sooner or later. Unfortunately in real-world projects, the data won’t be available to us in a neat .csv file. There we have to extract data from different sources like images, pdf files, doc files, image files, etc. In this article, we will see the perfect start to tackle those situations.

Below we will see how to extract relevant information from multiple such sources.

1. Multiple Sheet Excel Files

Note that if the Excel file has a single sheet then the same method to read CSV file (pd.read_csv(‘File.xlsx’)) might work. But it won’t in the case of multiple sheet files as shown in the below image where there are 3 sheets( Sheet1, Sheet2, Sheet3). In this case, it will just return the first sheet.

Excel sheet used: Click Here.

Example: We will see how to read this excel-file.

Python3

# import Pandas library 
import pandas as pd 
  
# Read our file. Here sheet_name=1 
# means we are reading the 2nd sheet or Sheet2 
df = pd.read_excel('Sample1.xlsx', sheet_name = 1) 
df.head()

Output:

Now let’s read a selected column of the same sheet:

Python3

# Read only column A, B, C of all 
# the four columns A,B,C,D in Sheet2 
df=pd.read_excel('Sample1.xlsx', 
                 sheet_name = 1, usecols = 'A : C') 
df.head()

Output:

Now let’s read all sheet together:

Sheet1 contains columns A, B, C; Sheet2 contains A, B, C, D and Sheet3 contains B, D. We will see a simple example below on how to read all the 3 sheets together and merge them into common columns.

Python3

df2 = pd.DataFrame() 
for i in df.keys(): 
    df2 = pd.concat([df2, df[i]],  
                    axis = 0) 
  
display(df2)

Output:

2. Extract Text From Images

Now we will discuss how to extract text from images.

For enabling our python program to have Character recognition capabilities, we would be making use of pytesseract OCR library. The library could be installed onto our python environment by executing the following command in the command interpreter of the OS:-

pip install pytesseract

The library (if used on Windows OS) requires the tesseract.exe binary to be also present for proper installation of the library. During the installation of the aforementioned executable, we would be prompted to specify a path for it. This path needs to be remembered as it would be utilized later on in the code. For most installations the path would be C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe.

Image for demonstration:

Python3

# We import necessary libraries.  
# The PIL Library is used to read the images 
from PIL import Image 
import pytesseract 
  
# Read the image 
image = Image.open(r'pic.png') 
  
# Perform the information extraction from images 
# Note below, put the address where tesseract.exe  
# file is located in your system 
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  
print(pytesseract.image_to_string(image)) 

Output:

GeeksforGeeks

3. Extracting text from Doc File

Here we will extract text from the doc file using docx module.

For installation:

pip install python-docx

Image for demonstration: Aniket_Doc.docx

Example 1: First we’ll extract the title:

Python3

# Importing our library and reading the doc file 
import docx 
doc = docx.Document('csv/g.docx') 
  
# Printing the title 
print(doc.paragraphs[0].text) 

Output:

My Name Aniket

Example 2: Then we’ll extract the different texts present(excluding the table).

Python3

# Getting all the text in the doc file 
l=[doc.paragraphs[i].text for i in range(len(doc.paragraphs))] 
  
# There might be many useless empty 
# strings present so removing them 
l=[i for i in l if len(i)!=0] 
print(l) 

Output:

[‘My Name Aniket’, ‘ Hello I am Aniket’, ‘I am giving tutorial on how to extract text from MS Doc.’, ‘Please go through it carefully.’]

Example 3: Now we’ll extract the table:

Python3

# Since there are only one table in 
# our doc file we are using 0. For multiple tables 
# you can use suitable for toop 
table = doc.tables[0] 
  
# Initializing some empty list 
list1 = [] 
list2 = [] 
  
# Looping through each row of table 
for i in range(len(table.rows)): 
    
    # Looping through each column of a row 
    for j in range(len(table.columns)): 
  
        # Extracting the required text 
        list1.append(table.rows[i].cells[j].paragraphs[0].text) 
  
    list2.append(list1[:]) 
    list1.clear() 
  
print(list2) 

Output:

[['A', 'B', 'C'], ['12', 'aNIKET', '@@@'], ['3', 'SOM', '+12&']]

4. Extracting Data From PDF File

The task is to extract Data( Image, text) from PDF in Python. We will extract the images from PDF files and save them using PyMuPDF library. First, we would have to install the PyMuPDF library using Pillow.

pip install PyMuPDF Pillow

Example 1:

Now we will extract data from the pdf version of the same doc file.

Python3

# import module 
import fitz 
  
# Reading our pdf file 
docu=fitz.open('file.pdf') 
  
# Initializing an empty list where we will put all text 
text_list=[] 
  
# Looping through all pages of the pdf file 
for i in range(docu.pageCount):  
    
  # Loading each page 
  pg=docu.loadPage(i) 
    
  # Extracting text from each page 
  pg_txt=pg.getText('text') 
    
  # Appending text to the empty list 
  text_list.append(pg_txt) 
    
# Cleaning the text by removing useless 
# empty strings and unicode character '\u200b' 
text_list=[i.replace(u'\u200b','') for i in text_list[0].split('\n') if len(i.strip()) ! = 0] 
print(text_list) 

Output:

[‘My Name Aniket ‘, ‘ Hello I am Aniket ‘, ‘I am giving tutorial on how to extract text from MS Doc. ‘, ‘Please go through it carefully. ‘, ‘A ‘, ‘B ‘, ‘C ‘, ’12 ‘, ‘aNIKET ‘, ‘@@@ ‘, ‘3 ‘, ‘SOM ‘, ‘+12& ‘]

Example 2: Extract image from PDF.

Python3

# Iterating through the pages 
for current_page in range(len(docu)): 
    
  # Getting the images in that page 
  for image in docu.getPageImageList(current_page): 
      
    # get the XREF of the image . XREF can be thought of a 
    # container holding the location of the image 
    xref=image[0] 
      
    # extract the object i.e, 
    # the image in our pdf file at that XREF 
    pix=fitz.Pixmap(docu,xref) 
      
    # Storing the image as .png 
    pix.writePNG('page %s - %s.png'%(current_page,xref)) 

The image is stored in our current file location as in format page_no.-xref.png. In our case, its name is page 0-7.png.

Now let’s plot view the image.

Python3

# Import necessary library 
import matplotlib.pyplot as plt 
  
# Read and display the image 
img=plt.imread('page 0 - 7.png') 
plt.imshow(img) 

Output:

Suggest improvement

How to read Dictionary from File in Python?

Share your thoughts in the comments

How To Extract Data From Common File Formats in Python?

1. Multiple Sheet Excel Files

Python3

Python3

Python3

2. Extract Text From Images

Python3

3. Extracting text from Doc File

Python3

Python3

Python3

4. Extracting Data From PDF File

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?