Speech-to-speech translation

Last Updated : 17 Nov, 2023

Speech-to-speech translation is a transformative process that converts spoken language on the fly from one language to another. In contrast to traditional methods involving transcription and subsequent translation, speech-to-speech translation directly interprets and converts spoken words, allowing for seamless communication across different languages. This article explores the implementation of a speech-to-speech translation mechanism, specifically from English to Hindi, through a straightforward step-by-step approach.

Speech-to-Speech Translation

Speech-to-speech translation can be referred to as simultaneous translation or real-time translation, which is the process of converting spoken language in real-time from one language to another. In traditional methods where speech is transcribed and then translated, the speech-to-speech translation directly interprets and converts spoken words, which maintains the flow of conversation across different languages. In this article, we will implement a speech-to-speech translation mechanism (English to Hindi) in simple steps.

How speech-to-speech translation works

Speech-to-speech translation works in four simple steps, which are discussed below:

Speech Recognition: This process is the first step in speech-to-speech translation, which begins with converting the spoken words in the source language into text through speech recognition systems. This involves identifying and transcribing the spoken words accurately.
Natural Language Processing (NLP): In this step, the transcribed text is processed using natural language processing techniques, which involve language analysis, understanding context, and identifying the relevant meaning of words and phrases.
Machine Translation: The identified text is translated into the target language using machine translation models. These models use various techniques, such as neural networks, statistical methods, or transformer-based architectures, to perform the translation. However, using these complex models is very costly and time and memory-consuming, so they have become irrelevant to implementing in real-time applications. Google’s speech recognition API has been used in recent days as it is free, easy to implement, and very fast and memory-efficient with high accuracy.
Speech Synthesis: Finally, the translated text is converted back into spoken words in the target language through speech synthesis techniques which allows the listener to hear the translated speech.

Model architecture

The model architecture of speech-to-speech translation is discussed below:

Speech Recognition: Utilizes acoustic and language models to transcribe spoken words of the input audio into text.
Language Understanding: Then it extracts the context and meaning from the transcribed text.
Translation: It employs machine translation techniques to convert the text from the source to the target language.
Speech Synthesis: Finally it utilizes text-to-speech methods to generate spoken words in the target language.

Step-by-step implementation

Installing required modules

At first we need to install all required modules to our runtime.

!pip install gTTS
!pip install SpeechRecognition
!pip install pydub
!pip install translate

Importing required libraries

Now we will import all required Python libraries like speech_recognition, Translator, gTTS etc.

Python3

from pydub import AudioSegment
import speech_recognition as sr
from translate import Translator
from gtts import gTTS

Driver functions

As speech-to-speech translation involves multiple steps so we need to define some driver functions to perform each step.

Format interchange: The input file for speech translation commonly in the format of MP3. But this format is not suitable for various Python modules. To change it to the desired format(.wav) we will define a small function (convert_mp3_to_wav) for format interchange.

Python3

def convert_mp3_to_wav(input_mp3, output_wav):
    audio = AudioSegment.from_mp3(input_mp3)
    audio.export(output_wav, format="wav")

Speech recognition: Next our task is to extract the text present in the input file. We will define a function (recognize_speech) which will utilize the Google’s Speech recognition API. It will return the recognized speech from input audio and handle most common exceptions which can be occurred.

Python3

def recognize_speech(audio_path):
    recognizer = sr.Recognizer()
    with sr.AudioFile(audio_path) as source:
        audio_data = recognizer.record(source)
    try:
        recognized_text = recognizer.recognize_google(audio_data)
        return recognized_text
    except sr.UnknownValueError:
        return "Speech recognition could not understand the audio"
    except sr.RequestError as e:
        return f"Could not request results from Google's Speech Recognition API; {e}"

Translation: Now we will define a function (translate_text) which will just translate the recognized text to the desired language.

Python3

def translate_text(text, target_language):
    if text is not None:
        translator = Translator(to_lang=target_language)
        translation = translator.translate(text)
        return translation
    else:
        return "No text to translate"

Speech generation: Now our task is to generate a speech or audio file with the translated text. We will define a small function (convert_text_to_speech) for this which will generate output audio file and save it to the runtime.

Python3

def convert_text_to_speech(text, lang_code, output_path):
    if text != "No text to translate":
        tts = gTTS(text=text, lang=lang_code)
        tts.save(output_path)

Pipeline function: Now we will create a pipeline (speech_to_speech_pipeline) which will be liable to control the whole process by calling other driver functions one after another. Here we will set the translation language as Hindi (‘hi’ language code) so that we can perform English to Hindi speech translation.

Python3

def speech_to_speech_pipeline(input_mp3, output_mp3, target_language='hi'):
    # Step 1: Convert MP3 to WAV
    wav_file = "temp_speech.wav"
    convert_mp3_to_wav(input_mp3, wav_file)
 
    # Step 2: Recognize Speech
    recognized_text = recognize_speech(wav_file)
    print("Recognized Speech:")
    print(recognized_text)
 
    # Step 3: Translate Recognized Text
    translated_text = translate_text(recognized_text, target_language)
    print("Translated Text:")
    print(translated_text)
 
    # Step 4: Convert Translated Text to Speech
    convert_text_to_speech(translated_text, target_language, output_mp3)
    audio = AudioSegment.from_mp3(output_mp3)
    return audio

Pipeline for translation

Now we will call the pipeline function(speech_to_speech_pipeline) which will take input then recognize the text from input audio file then translate it to desired language and finally save the translated audio file to the runtime so that we can hear it. Here we have used a sample audio but you can replace it with your desired audio file.

Python3

# Pipeline for speech translation
input_audio_file = "707907__jenajiejing__value1.mp3"  # replace it with your input file
output_audio_file = "translated_speech.mp3"
speech_to_speech_pipeline(input_audio_file, output_audio_file, target_language='hi')

Output:

Recognized Speech:
open the windows the song nights and the fresh air will come in take a Broadview you will see the foreign mountains fresh and green stay and start in life changing a new life will begin your spirituality will grow and you're thinking will transcend that arise from Human Nature and ethics
Translated Text:
खिड़कियां खोलें गीत की रातें और ताजी हवा आएगी एक ब्रॉडव्यू लें आप विदेशी पहाड़ों को ताज़ा और हरा - भरा प्रवास देखेंगे और जीवन में शुरुआत करेंगे एक नया जीवन बदलना शुरू होगा आपकी आध्यात्मिकता बढ़ेगी और आप सोच रहे होंगे कि मानव प्रकृति और नैतिकता से उत्पन्न होने वाले पार हो जाएंगे

Generated Audio

You can listen to the generate audio from here.

Conclusion

We can conclude that speech-to-speech translation involves several steps but performing them sequentially makes the process easy. . The article introduces a comprehensive pipeline function, “speech_to_speech_pipeline,” which orchestrates the entire translation process, making it accessible for users interested in English-to-Hindi speech translation. Speech-to-speech translation in very important for various real-time application.

Suggest improvement

Translatotron 2 Speech-to-Speech Translation Architecture

Share your thoughts in the comments

Speech-to-speech translation