OpenAI Whisper – Converting Speech to Text

Last Updated : 07 Nov, 2023

In the digital era, the demand for precise and efficient transcription of audio content is everywhere, spanning across professions and purposes. Whether you’re creating subtitles, conducting research, or pursuing various other tasks, the conversion of audio and video to text is a common requirement. While numerous modules exist for this purpose, the goal of 100% accuracy remains a challenge for machine learning models.

Enter the Whisper Model, a Python library that stands out for its exceptional accuracy in speech-to-text conversion, providing exact word recognition. This article delves into the world of Whisper, offering a comprehensive guide on how to harness its capabilities for audio transcription in Python, all without the need for external APIs. The transcription mentioned in this article does not involve sharing our data with cloud providers or sending it out to APIs; rather, everything is done locally. Thus, privacy is ensured.

Note: In this article, we will not be using any API service or sending the data to the server for processing. Instead, everything is done locally on your computer for free. It is completely model- and machine-dependent.

About OpenAI Whisper

Whisper is a general-purpose speech recognition model made by OpenAI. It is a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It was trained using an extensive set of audio.

Whisper is a transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model, which is considered powerful in machine learning, specifically natural language processing.

Architecture and Working

Working of Whisper Model

The Whisper architecture is a traditional encoder-decoder transformer consisting of 12 transformer blocks in both the encoder and the decoder. Each transformer block includes a self-attention layer and a feed-forward layer.

To establish connectivity between the encoder and decoder, a cross-attention layer is utilized. This layer allows the decoder to focus on the encoder output, aiding in the generation of text tokens that align with the audio signal. To develop its transcription capabilities, Whisper is trained on a vast dataset containing multilingual audio and text data. This extensive training data equips Whisper with the ability to transcribe speech proficiently in different languages and accents, even in environments plagued by noise.

Whisper employs a two-step process when processing audio input. Initially, it divides the input into 30-second segments. Next, each segment undergoes conversion into a mel-frequency cepstrum (MFC), which is a robust representation of the audio signal that accounts for both noise and accents. These MFCs are then sent to the encoder, which is responsible for learning to extract features from the audio signal.

The output of the encoder is subsequently forwarded to the decoder. The decoder’s role is to learn how to produce text captions using the audio features. Throughout this process, the decoder employs an attention mechanism to focus on various sections of the encoder output as it generates the text captions.

Utilizing the Whisper for Audio Transcription

Tools Required

A text editor (VS Code is recommended)
The latest version of Python
Or google colab

Steps to follow

If you are working on your desktop, then create a new folder, open VS Code, and create the file ‘app.py’ in the folder you have just created.

Screenshot-2023-10-16-204450

Installing Package

To import Whisper and use it to transcribe, we first need to install it on our local machine. Use the following command to install the package using Python in the terminal: It is required to have an internet connection while installing the package.

!pip install openai-whisper

Installing FFmpeg (For Windows)

Install and set ffmpeg in your machine’s environment variables. This step is only needed in Windows to get rid of the File Not Found error. If you are working in Google Colab, you can skip it.

Refer to this post to install ffmpeg on Windows.

Importing Package

Import the package using the following code:

Python3

import whisper

Loading Model

You can load the model to process the audio using the load_model function with model size as the parameter. Below is the code for it:

Python3

model = whisper.load_model("base")

Note: Here we are using the “base” model, which is the second-smallest model. The bigger the model, the bigger its size and accuracy. If you want to get perfect accuracy and have good resources (RAM, CPU), then go for the “large” model, or else go with “base” and “medium,” which provide decent results. The base model size is 139 MB.

Transcribing Audio

After loading the model, you can use the transcribe function to run the processing on the input audio to convert it into text. The transcribe function needs an important parameter, which is the name of your audio file, to get transcribed.

Here is the code for it.

Note: The audio file should be present in the current working directory; give the global path to it.

Python3

transcription = model.transcribe("file_name.mp3")

Optional Parameters

Some of the additional useful parameters we can use while transcribing audio are:

word_timestamps: as True, the output is returned with words and the time when they were spelt in the audio.
initial_prompt: This is very handy when we want to specify how the transcription should be. For example, if the output doesn’t contain punctuation, we can specify a starting sentence like “Hello, include punctuation!” to compel the model to punctuate the transcription.
temperature: This is used to control the randomness of the transcription.

Getting Result

After processing is done, the result is stored in the transcription variable, which contains text and metadata for transcription. We can extract the final text using the below code:

Python3

print(transcription["text"])

Complete Code

Here is the complete code for transcribing audio into text. You can download the audio file from here.

Python3

import whisper
 
model = whisper.load_model("base")
 
transcription = model.transcribe("audio.mp3")
 
print(transcription["text"])

Output:

Transcription: All in all, everyone, this audio is for demo purposes to show how whisper transforms the audio data into text. Thank you.

Applications

Whisper has a range of applications, such as:

Speech Recognition: Whisper enables the conversion of audio recordings into written text. This functionality proves valuable in generating transcripts for various contexts like meetings, lectures, and other audio recordings.
Speech Translation: Whisper facilitates the translation of spoken language from one language to another. This capability is particularly helpful in communication with individuals who speak different languages.
Language Detection: Whisper can be utilized to identify the language present in an audio recording. This feature is beneficial in determining the language used in a video or audio clip.

Overall, the model can be used in many ways.

Suggest improvement

Converting Text to Speech in Java

Share your thoughts in the comments

OpenAI Whisper – Converting Speech to Text

About OpenAI Whisper

Architecture and Working

Utilizing the Whisper for Audio Transcription

Tools Required

Steps to follow

Installing Package

Installing FFmpeg (For Windows)

Importing Package

Python3

Loading Model

Python3

Transcribing Audio

Python3

Optional Parameters

Getting Result

Python3

Complete Code

Python3

Applications

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?