Open In App

PyTorch for Speech Recognition

Speech recognition is a transformative technology that enables computers to understand and interpret spoken language, fostering seamless interaction between humans and machines. By implementing algorithms and machine learning techniques, speech recognition systems transcribe spoken words into text, facilitating a diverse array of applications. In this article, we will see how to use Pytorch for speech recognition.

Using PyTorch For Speech Recognition

In this section, we will delve into the process of using PyTorch for speech recognition, covering essential steps from loading and preprocessing audio data to leveraging state-of-the-art models like Wav2Vec2 for transcription. Whether you're a beginner exploring the field of speech recognition or an experienced developer looking to implement advanced models, this guide will provide you with practical insights and code examples to get started with PyTorch for speech recognition tasks.

We will use PyTorch for audio processing by following these steps:

Installing Required Packages

We will be importing following libraries:

pip install torch torchaudio
pip install matplotlib
pip install librosa

Loading and Preprocessing Audio Data

import torchaudio

# Load audio file
waveform, sample_rate = torchaudio.load(' Your audio_file.wav')
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
waveform = resampler(waveform)
mfcc_transform = torchaudio.transforms.MFCC(sample_rate=sample_rate)
mfcc = mfcc_transform(waveform)
import librosa

# Load audio file
waveform, sample_rate = librosa.load('audio_file.wav', sr=None)

# Resampling (if needed)
waveform = librosa.resample(waveform, orig_sr=sample_rate, target_sr=target_sample_rate)

# Feature Extraction (e.g., MFCCs)
mfcc = librosa.feature.mfcc(waveform, sr=target_sample_rate)

Using Wav2Vec2 Model for Speech Recognition

We will use the following audio file.

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import matplotlib.pyplot as plt

# Load pre-trained model
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

# Load audio data
waveform, sample_rate = torchaudio.load("your_audio.wav")
waveform_resampled = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(waveform)

# Plot waveform and spectrogram
plt.figure(figsize=(12, 6))
plt.subplot(2, 1, 1)
plt.plot(waveform.t().numpy())
plt.title('Waveform')
plt.xlabel('Sample')
plt.ylabel('Amplitude')

plt.subplot(2, 1, 2)
spectrogram = torchaudio.transforms.Spectrogram()(waveform_resampled)
# Extract the first channel of the spectrogram for visualization
spectrogram_channel1 = spectrogram[0, :, :]
plt.imshow(spectrogram_channel1.log2().numpy(), aspect='auto', cmap='inferno')
plt.title('Spectrogram')
plt.xlabel('Time')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

# Perform inference
with torch.no_grad():
    logits = model(waveform_resampled).logits

# Decode logits to text
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

# Print transcription
print("Transcription:", transcription)

Output:

Transcription: ['THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOUR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TAKOZAL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN', 'THE STALE SMELL OF OLD BEER LINGERS IT TAKES HEAT TO BRING OUT THE ODOR A COLD DIP RESTORES HEALTH AND ZEST A SALT PICKLE TASTES FINE WITH HAM TAKOZAL PASTOR ARE MY FAVORITE A ZESTFUL FOOD IS THE HOT CROSS BUN']

Figure-1-13-04-2024-01_11_25The transcribed text represents the spoken words in the audio file. It is the result of passing the audio waveform through the Wav2Vec2 model and decoding the output logits into text. The transcription provides a textual representation of the audio content, allowing for further analysis or processing of the spoken words.The transcription is printed to the console using the print("Transcription:", transcription) statement. This allows the user to easily view the transcribed text and use it for various purposes, such as text analysis, captioning, or indexing audio content.

Overall, the output of the code provides the transcribed text of the input audio file, demonstrating the ability of the Wav2Vec2 model to perform automatic speech recognition tasks.

Article Tags :