Preprocessing the Audio Dataset

Audio preprocessing is a critical step in the pipeline of audio data analysis and machine learning applications. It involves a series of techniques applied to raw audio data to enhance its quality, extract meaningful features, and prepare it for further analysis or input into machine learning models. Effective preprocessing can significantly impact the performance and accuracy of models trained on audio data, making it an essential aspect of audio signal processing. In this article, we will discuss and implement various audio preprocessing techniques.

Why perform preprocessing to Audio datasets?

There are various reasons to perform audio data preprocessing, which are listed below:

Noise Reduction: Audio data collected from real-world environments often contains background noise, interference, or artefacts. Preprocessing methods like filtering and denoising can help to remove unwanted noise, ensuring that the model focuses on the relevant signal.
Standardization of Formats: Audio datasets may come in various formats, sample rates, or resolutions. Preprocessing ensures standardization, making it easier to work with diverse datasets and preventing inconsistencies that can affect model performance.
Feature Extraction: Audio signals are complex and high-dimensional. Preprocessing extracts relevant features from the raw data, like spectral characteristics, Mel-frequency cepstral coefficients (MFCCs), or chroma features. These features provide a more compact representation of the audio, preserving essential information for analysis.
Resampling: Standardizing the sample rate of audio signals through resampling is common during preprocessing. This step can make the data more computationally efficient and compatible with models that require a specific sample rate.
Normalization: Scaling the amplitude of audio signals ensures that the model is not biased toward signals with higher or lower energy levels. Normalization helps in maintaining consistent signal magnitudes across the dataset.
Handling Variable Lengths: Audio clips in a dataset may have varying lengths. Preprocessing often involves segmenting or padding the signals to achieve a uniform length, ensuring compatibility with models that require fixed-length inputs.
Model Efficiency: Well-preprocessed data can lead to more efficient training and inference processes. It reduces computational load, accelerates convergence during training, and enhances the model’s ability to generalize to new, unseen data.
Improved Model Performance: Preprocessing enhances the signal-to-noise ratio which emphasizes relevant features and ensures that the model is provided with high-quality inputs. This leads to improved model performance, accuracy and robustness.

Step-by-step implementation

Installing required module

At first, we need to install all required Python module to our runtime.

!pip install gdown
!pip install librosa

Importing required libraries

Now we will import all required Python libraries like NumPy, SciPy, Librosa and OS etc.

Python3

import librosa

from scipy.signal import butter, filtfilt

import numpy as np

import os

import matplotlib.pyplot as plt

Dataset loading

Now we will load a small audio dataset for further implementations. You can download it from here or directly load it to runtime by the following code.

Python3

file_id = '1lNUGw8VMXvY2Yu6aITYlOCNaj8y-KbNB'
 
# Download the dataset

!gdown --id $file_id -O dataset.zip
 
# Unzip the dataset

!unzip -q dataset.zip -d /content/

Resampling

Resampling is the process of changing the sample rate of an audio signal. It involves adjusting the number of samples per second while preserving the original content’s perceptual characteristics. This is commonly done to standardize audio data to a specific sample rate which makes it compatible with models or systems that require a uniform sampling frequency. Resampling can help in mitigating issues related to mismatched sample rates and enhancing the computational efficiency of subsequent processing steps. Here we will define a small function (resample_audio) to resample the audio files to a constant and specified sampling rate.

Python3

# Load a sample audio file

sample_audio_path = '/content/barbie_vs_puppy/barbie/barbie_4.wav'
 
# Resample the audio

def resample_audio(audio_path, target_sr=16000):

    y, sr = librosa.load(audio_path, sr=target_sr)

    return y, sr

resampled_audio, sr = resample_audio(sample_audio_path)

print(f"Sample rate after Resampling: {sr}")

Output:

Sample rate after Resampling: 16000

So, we have resampled one audio file of the dataset to our desired sampling rate. We can now just iterate the dataset to resample all the files. We will see that in this article further.

Filtering

When Audio data pre-processing is required then one of the most important steps is Filtering which typically involves the application of signal processing filters like low-pass, high-pass or band-pass filters. Filtering is used to modify the frequency content of an audio signal by attenuating or emphasizing certain frequency components which may lead to make problem during model feeding like high background noise. In general cases, Low-pass filtering is applied to audio data to remove high-frequency noise which ensures that the model focuses on relevant signal information. Here we will define a small function (butter_lowpass_filter) to remove the background noises.

Python3

# Apply filtering

def butter_lowpass_filter(data, cutoff_freq, sample_rate, order=4):

    nyquist = 0.5 * sample_rate

    normal_cutoff = cutoff_freq / nyquist

    b, a = butter(order, normal_cutoff, btype='low', analog=False)

    filtered_data = filtfilt(b, a, data)

    print(f"Filtered audio shape: {filtered_data.shape}")

    return filtered_data
 
filtered_audio = butter_lowpass_filter(resampled_audio, cutoff_freq=4000, sample_rate=sr)

Output:

Filtered audio shape: (31403,)

As we can see the filtering is successful the shape is very irrelevant for further task. From this context we are going to see how we can convert it to model’s expected input shape.

Converting audio data to model’s expected input

Converting audio data to the model’s expected input involves shaping the raw audio signal into a format suitable for feeding into a machine learning model. This often includes operations such as trimming, padding, or extracting specific features from the audio data. The goal is to create a standardized representation that aligns with the input requirements of the model architecture. This step ensures that the model can effectively process and learn from the audio data during training or make accurate predictions during inference. Here we will define a small function (convert_to_model_input) by which we will trim the audio files to a fixed target length which is here 16000 as it is most general input shape for model feeding. You can change this which in the code as per your requirement.

Python3

# Convert audio data to the model’s expected input

def convert_to_model_input(y, target_length):

    if len(y) < target_length:

        y = np.pad(y, (0, target_length - len(y)))

    else:

        y = y[:target_length]

    return y
 
model_input = convert_to_model_input(filtered_audio, target_length=16000)

print(f"Model input shape: {model_input.shape}")

Output:

Model input shape: (16000,)

Audio Data Streaming

Till now we have performed resampling, filtering and converting the shape of the audio data to model’s expected input shape. But it is impossible to do this tasks in one by one audio files if the audio dataset is large. And we need all the audio files in same nature to use them for further tasks. From here we will consider Audio data streaming which is a process of handling and processing audio data in a sequential, batched or real-time manner. Instead of loading the entire dataset into memory at once, streaming allows the model to process data in chunks or batches which facilitates efficient memory usage and enables the model to handle datasets that may not fit entirely into memory. This approach is crucial for training models on extensive audio datasets or when dealing with continuous audio streams in real-time applications. Here we will define a function (stream_audio_dataset) which will call all other three previous functions for resampling, filtering and converting to process the all the audio files present in the dataset batch-wise. The batch size should be defined as per the dataset size and hardware resources.

The generator function stream_audio_dataset loads and processes the audio files from a specified dataset path.

the audio_files is a list containing the path of all audios in the specified dataset.
the list of audio file paths is shuffled to introduce randomness.
In the batch processing step,
- the function yields batches of audio data with a specified batch size.
- for each batch, the code loads and preprocesses each audio file in the batch.
- audio files are loaded using Librosa, and resampling is applied if a target sampling rate is mentioned.
- Low pass filter is applied to each audio file.
The generator processes the batches of audio data from the specified dataset path.
Each batch is printed with the shape of the first file in the batch.

Python3

def stream_audio_dataset(dataset_path, batch_size=32, target_length=16000, target_sr=None):

    # Get all audio file paths in the dataset path

    audio_files = [os.path.join(root, file) for root, dirs, files in os.walk(dataset_path) for file in files]
 
    # Shuffle the audio files for randomness

    np.random.shuffle(audio_files)
 
    # Yield batches of audio data

    for i in range(0, len(audio_files), batch_size):

        batch_paths = audio_files[i:i + batch_size]

        batch_data = []
 
        for file_path in batch_paths:

            # Load and preprocess each audio file

            y, sr = librosa.load(file_path, sr=target_sr)
 
            # Resampling

            if target_sr is not None and sr != target_sr:

                y = librosa.resample(y, sr, target_sr)

                sr = target_sr
 
            filtered_audio = butter_lowpass_filter(y, cutoff_freq=4000, sample_rate=sr)

            model_input = convert_to_model_input(filtered_audio, target_length=target_length)

            batch_data.append(model_input)
 
        yield np.array(batch_data)
 
# Load the dataset folder

dataset_path = '/content/barbie_vs_puppy/barbie'
 
for batch_data in stream_audio_dataset(dataset_path, batch_size=2, target_sr=16000):

    # Process each batch of audio data

    print(f"Processing batch with {len(batch_data)} files")
 
    # Print the shape of the first file in the batch

    print(f"Shape of the first file: {batch_data[0].shape}")

Output:

  Filtered audio shape: (31403,)
Filtered audio shape: (40278,)
Processing batch with 2 files
Shape of the first file: (16000,)
Filtered audio shape: (31137,)
Filtered audio shape: (31360,)
Processing batch with 2 files
Shape of the first file: (16000,)
Filtered audio shape: (36088,)
Filtered audio shape: (27596,)
Processing batch with 2 files
Shape of the first file: (16000,)
Filtered audio shape: (25600,)
Filtered audio shape: (24333,)
Processing batch with 2 files
Shape of the first file: (16000,)

So, from this output we can see that inconsistent shapes after filtering and resampling are getting processed batch-wise and then each audio file of the batch is getting a consistent shape of (16000,). This indicates that the dataset is now well prepared for further complex tasks.

Log-Mel Spectrogram

A log-mel spectrogram is a representation of the frequency content of an audio signal over time, widely used in audio signal processing and machine learning applications. It is derived from the Mel spectrogram, which divides the audio spectrum into perceptually relevant frequency bands called Mel bins. The log-mel spectrogram further enhances visualization by taking the logarithm of the power values which makes it more suitable for human perception and aiding in the extraction of meaningful features. This representation is particularly valuable for tasks like speech recognition, sound classification and music analysis, where it captures essential acoustic characteristics and patterns in the audio signal. The resulting log-mel spectrogram provides a compact and informative representation that is commonly used as input to deep learning models for audio-based tasks. Here we will define a small function to generate log-mel spectrogram for this dataset’s audio signals.

Python3

# visualizing log-mel spectrogram

def compute_logmel_spectrogram(y, sr, n_mels=128, hop_length=512):

    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, hop_length=hop_length)

    logmel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)

    return logmel_spectrogram
 
# load the audio file

audio_file_path = '/content/barbie_vs_puppy/barbie/barbie_4.wav'

target_sr = 16000
 
# Load the audio file

y, sr = librosa.load(audio_file_path, sr=target_sr)
 
# Compute log-mel spectrogram

logmel_spectrogram = compute_logmel_spectrogram(y, sr=sr)
 
# Display the log-mel spectrogram

plt.figure(figsize=(8, 4))

librosa.display.specshow(logmel_spectrogram, sr=sr, hop_length=512, x_axis='time', y_axis='mel')

plt.colorbar(format='%+2.0f dB')

plt.title('Log-Mel Spectrogram')
plt.show()

Output:

Log-Mel Spectrogram

Conclusion

We can conclude that Audio data pre-processing is very important task which may involve numerous steps, but it is required to perform to prepare the dataset for real-time tasks.

Article Tags :

AI-ML-DS

Data Analysis

Geeks Premier League

AI-ML-DS With Python

Geeks Premier League 2023