Determining file format using Python
The general way of recognizing the type of file is by looking at its extension. But this isn’t generally the case. This type of standard for recognizing file by associating an extension with a file type is enforced by some operating system families (predominantly Windows). Other OS’s such as Linux (and its variants) use the magic number for recognizing file types. A Magic Number is a constant value, used for the identification of a file. This method provides more flexibility in naming a file and does not mandate the presence of an extension. Magic numbers are good for recognizing files, as sometimes a file may not have the correct file extension (or may not have one at all).
In this article we will learn how to recognize files by their extension, using python. We would be using the Python Magic library to provide such capabilities to our program. To install the library, execute the following command in your operating system’s command interpreter:-
pip install python-magic
For demonstration purpose, we would be using a file name apple.jpg with the following contents:-
Apparent from the contents, the file is an HTML file. But since it is saved with a .jpg extension, the operating system won’t be able to recognize its actual file type. So this file would be befitting for our program.
HTML document, ASCII text, with CRLF line terminators text/html
Firstly we import the magic library. Then we use magic.from_file() method to attain the human-readable file type. After which we use the mime=True attribute to attain the mime type of the file.
Things to consider while using the above code:
- The code works on Linux and Mac OS. But there exists an inbuilt terminal command named file on those operating systems, which provide the same functionality as this program, without installing any other library.
- File type recognition using extensions also exists in the newer versions of the library.
- Since the file type recognition generally happens by fingerprint lookup of the header of the file, it is not mandatory for one to load the whole file for type recognition. Small sections of the files could also be provided as an argument using magic.from_buffer() and passing the initial bytes of the file using open(‘file.ext’, ‘rb’).read(n) (Only recommended if aware of the header format of the file type).