Open In App

OpenAI Whisper

Last Updated : 11 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In today’s time, data is available in many forms, like tables, images, text, audio, or video. We use this data to gain insights and make predictions for certain events using various machine learning and deep learning techniques. There are many techniques that help us work on tables, images, texts, and videos, but there are not a lot of techniques to work on audio data. It is still not very easy to work on audio data directly and extract information. Luckily, audio can be converted to textual data, which allows for the extraction of information. There are many tools available to convert audio to text; one such tool is Whisper.

What is Whisper?

Whisper is, in general, a audio-recognition model. It is a multi-task model that is capable of speech recognition in many languages, voice translation, and language detection. Due to its intensive training on vast amounts of multilingual and multitask-supervised data, Whisper is able to distinguish and understand a wide range of accents, dialects, and speech patterns. Thanks to this extensive training, Whisper can deliver very accurate and contextually relevant transcriptions even in challenging acoustic environments. Its versatility makes it suitable for a wide range of uses, such as converting audio recordings into text, enabling real-time transcription during live events, and fostering seamless communication between speakers of various languages.

Whisper not only has a lot of potential to increase efficiency and accessibility, but it also contributes to bridging the communication gap between various industries. Experts in fields like journalism, customer service, research, and education can benefit from its versatility and accuracy as a tool since it helps them streamline their procedures, gather important data, and promote effective communication.

Whisper Model Details

Whisper is an encoder-decoder model trained on a large amount of speech data for tasks such as speech recognition and speech translation. There are pre-trained checkpoints on the Hugging Face Hub for whisper which is certainly beneficial for researchers and developers looking to leverage these models for their own applications.

How Does OpenAI Whisper Work?

Whisper is a complex system incorporating multiple deep learning models trained on a massive dataset of audio and text. Here’s a simplified explanation on how it works:

  1. Audio Preprocessing: The audio input is divided into short segments and converted into spectrograms (visual representations of audio frequencies).
  2. Feature Extraction: Deep learning models extract relevant features from the spectrograms, capturing linguistic and acoustic information.
  3. Language Identification: If the language is unknown, a separate model identifies it from supported languages.
  4. Speech Recognition: A model trained on spoken language predicts the most likely sequence of words that corresponds to the extracted features.
  5. Translation (Optional): If translation is requested, another model translates the recognized text into the desired language.
  6. Post-processing: The output is refined using language rules and heuristics to improve accuracy and readability.

Benefits of Using OpenAI Whisper

  • High Accuracy: Whisper achieves state-of-the-art results in speech-to-text and translation tasks, particularly in domains like podcasts, lectures, and interviews.
  • Multilingual Support: It handles over 57 languages for transcription and can translate from 99 languages to English.
  • Robustness to Noise and Accents: Whisper is relatively good at handling background noise, different accents, and technical jargon.
  • Open-Source Availability: The model and inference code are open-source, allowing for customization and research contributions.
  • API and Cloud Options: It has both a free command-line tool and a paid API for cloud-based processing, offering flexibility for different use cases.
  • Cost-Effectiveness: The API pricing is competitive compared to other speech-to-text solutions.

How to use OpenAI API for Whisper in Python?

Step 1: Install Openai library in Python environment

!pip install -q openai

Step 2: Import Openai library and add your API KEY in the environment

Import the openai library and assign your generated API KEY by replacing “YOUR_API_KEY” with your API key in the below code

Python3




import openai
# add your API key here
openai.api_key = "YOUR_API_KEY"


Step 3: Open your audio file and pass it to the desired module

There are 2 modules available for Whisper module:

1. Transcribe: This module transcribes your audio file into the input language. Model parameters for this module are:

  • file [required]: The audio file to transcribe, in one of these formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.
  • model [required]: ID of the model to use. Only whisper-1 is currently available.
  • prompt [optional]: An optional text to guide the model’s style or continue a previous audio segment. The prompt should match the audio language.
  • response_format [optional]: The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
  • temperature [optional]: The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
  • language [optional]: The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.
# opening the audio file in read mode
audio_file = open("FILE LOCATION", "rb")
# calling the module using this line and passing the model name and audio file
# there is only one model available for speech-to-text conversion
transcript = openai.Audio.transcribe(file="audio file", model="whisper-1")
transcript

2. Translate: This module translates your audio file into English language. Model parameters for this module are:

  • file [required]: The audio file to translate, in one of these formats: mp3, mp4, mpeg, mpga, m4a, wav, or webm.
  • model [required]: Model name which you wish to use. Only whisper-1 is currently available.
  • prompt [optional]: An optional text to guide the model’s style or continue a previous audio segment. The prompt should be in English.
  • response_format [optional]: The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
  • temperature [optional]: The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
# opening the audio file in read mode
audio_file = open("FILE LOCATION", "rb")
# calling the module using this line and passing the model name and audio file
# there is only one model available for speech-to-text conversion
transcript = openai.Audio.translate(file="audio file", model="whisper-1")
transcript

Note: Audio file size should not be larger then 25 MB. If the file size is greater than 25 MB then you should break the file into smaller chunks.

Example Implementation of Whisper using OpenAI in Python

1. Implementing Transcribe module

Audio we will be using for trying out the Transcribe module:

We will execute the following code see the results:

Python3




# transcript using openai module
path = "path of the audio file"
audio_file = open(path, "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
transcript['text']


Output:

Do you miss the interactive environment of a classroom and face-to-face interaction with an expert or a mentor? If you do, then I have great news for you. GeeksforGeeks is starting a classroom program in Noida and I am here to invite you for the same. We are going to begin our classroom program on full stack development, where we are going to focus on skills that are required to make you employable and personalized learning to help you achieve your goals. We encourage you to sign up and be a part of this new exciting journey. So see you at the classes.

2. Implementing Translate module

Audio we will be using for trying out the Translate module:

We will execute the following code see the results:

Python3




# translate using openai module
audio_file= open("/content/q-qkQfAMHGw_128.mp3", "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)
transcript['text']


Output:

Prompt engineering is a word that you must have heard somewhere. But do you know what is its exact use? And where is it used exactly in the software industry? If not, then let's know. Number 1, Rapid innovation. So any company wants to develop and deploy its new product as soon as possible. And give new services to its customers as soon as possible. So that it remains competitive in its entire tech market. So here prompt engineering comes in a lot of use. Number 2 is cost saving. So prompt engineering allows any company to save its total time and cost. Apart from this, the entire development process streamlines it. Due to which the time to develop the product is reduced and its cost is reduced. Number 3 is demand for automation. So whatever you see in your environment today, everyone wants their entire process to be automated. And prompt engineering allows this. It allows to make such systems that totally automate the process that is going on in your company. So now you know the importance of prompt engineering. If you know more important things than this, then quickly comment below.

Frequently Asked Question (FAQs)

Q. What is Whisper AI used for?

Whisper AI is a multi-task model that is capable of speech recognition in many languages, voice translation, and language detection.

Q. Is Whisper AI free to use?

Unlike GPT and DALL-E, Whisper is an open-source and free model.

Q. What is the Whisper model?

Whisper is an automatic speech recognition model trained on 680,000 hours of multilingual data collected from the web.

Q. Does Whisper accept .mp4 files?

Yes, you can use Whisper on audio files with extension: mp3, mp4, mpeg, mpga, m4a, wav, or webm.

Q. Where can I find the documentation for Whisper model?

You can find the Readme file in their GitHub repository [https://github.com/openai/whisper].

Q. Is Whisper model different from OpenAI Whisper?

No, OpenAI Whisper API and Whisper model are the same and have the same functionalities.

Conclusion

In this article we discussed about Whisper AI, and how it can be used transform audio data to textual data. This textual data can be used to gain insight and apply machine learning or deep learning algorithms. WhisperAI promises to open up new opportunities for voice technology as its capabilities develop, making voice-driven applications more effective, inclusive, and user-friendly. WhisperAI raises the bar for speech recognition and transcription by utilising AI, enabling people and organisations to interact more effectively in a quickly changing digital environment.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads