Determining file format using Python

The general way of recognizing the type of file is by looking at its extension. But this isn’t generally the case. This type of standard for recognizing file by associating an extension with a file type is enforced by some operating system families (predominantly Windows). Other OS’s such as Linux (and its variants) use the magic number for recognizing file types. A Magic Number is a constant value, used for the identification of a file. This method provides more flexibility in naming a file and does not mandate the presence of an extension.  Magic numbers are good for recognizing files, as sometimes a file may not have the correct file extension (or may not have one at all).

In this article we will learn how to recognize files by their extension, using python. We would be using the Python Magic library to provide such capabilities to our program. To install the library, execute the following command in your operating system’s command interpreter:-

pip install python-magic

For demonstration purpose, we would be using a file name apple.jpg with the following contents:-

Apparent from the contents, the file is an HTML file. But since it is saved with a .jpg extension, the operating system won’t be able to recognize its actual file type. So this file would be befitting for our program. 



Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import magic
  
# printing the human readable type of the file
print(magic.from_file('apple.jpg'))
  
# printing the mime type of the file
print(magic.from_file('apple.jpg', mime = True))

chevron_right


Output:

HTML document, ASCII text, with CRLF line terminators
text/html

Explanation:

Firstly we import the magic library.  Then we use magic.from_file() method to attain the human-readable file type. After which we use the mime=True attribute to attain the mime type of the file. 

Things to consider while using the above code:

  • The code works on Linux and Mac OS. But there exists an inbuilt terminal command named file on those operating systems, which provide the same functionality as this program, without installing any other library.
  • File type recognition using extensions also exists in the newer versions of the library.
  • Since the file type recognition generally happens by fingerprint lookup of the header of the file, it is not mandatory for one to load the whole file for type recognition. Small sections of the files could also be provided as an argument using magic.from_buffer() and passing the initial bytes of the file using open(‘file.ext’, ‘rb’).read(n) (Only recommended if aware of the header format of the file type).

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.