Python – Downloading captions from YouTube

Python provides a large set of APIs for the developer to choose from. Each and every service provided by Google has an associated API. Being one of them, YouTube Transcript API is very simple to use provides various features.

In this article, we will learn how to download captions or subtitles from a YouTube video. The subtitles can be auto-generated by YouTube or can be manually added by the mentor, in case it has both types available we would take a look at how to get specifically manual or automatic captions too. We would also explore how to get captions of specific languages and translate captions from one language to another. Then we would also see how to write the transcript in a text file.

youtube_transcript_api: This module is used for getting the captions/subtitles from a YouTube Video. It can be installed using:

pip install youtube-transcript-api # for windows
or 
pip3 install youtube-transcript-api # for Linux and MacOs

Before starting with the process we would like to explain how we can get the video id of a YouTube video. For Example, if a YouTube video has the following URL

https://youtu.be/SW14tOda_kI

Then the video id for this video would be “SW14tOda_kI”, i.e. all the phrases after the ?v= counts as the video id. This is unique for each video on YouTube.

Getting Started

Now we would start with the basics, In the first code snippet, we are trying to get the transcript of the video id using the .get_transcript() function.
It returns us a list of dictionaries in which each dictionary contains 3 key-value pair inside it, the first one being the content, the second one being the time instant from which the caption sentence/phrase start to be spoken and the third one being the duration in seconds that is taken to speak the sentence or phrase completely.
First-line basically imports the required packages and the next line assigns a variable to store the list of dictionaries and finally on the 3rd line it prints out the variable.

Python3

from youtube_transcript_api import YouTubeTranscriptApi 

# assigning srt variable with the list 
# of dictionaries obtained by the get_transcript() function

srt = YouTubeTranscriptApi.get_transcript("SW14tOda_kI")

# prints the result

print(srt)

Output:

For getting transcripts of more than one video we can pass them using commas, as in YouTubeTranscriptApi.get_transcript(videoid1, Videoid2, ….), in this case, we would have a list of lists and gain inside each inner list we would have a dictionary.

Getting Transcript of a particular Language

Now if we want to get the transcript of a specific language, we can mention the language as a parameter. In the next code snippet, we aim to do the same. All the code and working would be the same as the previous example with the difference that this time it will get only the transcripts in English and ignore subtitles if any exists.

Python3

from youtube_transcript_api import YouTubeTranscriptApi
 
# assigning srt variable with the list of dictionaries 
# obtained by the .get_transcript() function
# and this time it gets only the subtitles that 
# are of english language.

srt = YouTubeTranscriptApi.get_transcript("SW14tOda_kI", 

                                          languages=['en'])
 
# prints the result

print(srt)

Output:

Since the video we are considering for this example, only has English subtitles so both the examples gave us the same answer.

Getting List of all transcripts

Now to get the list of all transcripts of a video we can use the .list_transcripts() function. This function returns us all the transcripts of all the languages available for the video. It returns the TranscriptList object which is iterable and provides methods to filter the list of transcripts for specific languages and types.
Next, we use functions to fetch some data about the transcript from the metadata obtained.
- transcript.video_id returns us the video ID of the video
- transcript.language returns us the language of the transcript
- transcript.language_code returns us the language code of the transcript, for example, “en” for English, etc.
- transcript.is_generated tell us whether it has been manually created or generated by YouTube
- transcript.is_translatable tells whether this transcript can be translated or not
- transcript.translation_languages which give us a list of languages the transcript can be translated to.
Then we use .fetch() function to fetch the actual transcript.
Then we also showed how to use the .translate() function to convert/translate the caption from one language to another if at all it’s translatable (since we have only English subtitles for this language it might be not evident in this case, but this translation is very useful if there are transcripts of more than one language in the video).
Next line we have the .find_transcript() function that helps us to get the actual transcript of the video we are wanting along with the metadata.
Finally, we used the .find_manually_created_transcript() function to specifically find manual subscripts, similar to this we have .find_generated_transcript() which we have not used in this example since there are no generated captions, and we have only manual captions here.

Python3

# importing the module

from youtube_transcript_api import YouTubeTranscriptApi
 
# retrieve the available transcripts

transcript_list = YouTubeTranscriptApi.list_transcripts('SW14tOda_kI')
 
# iterate over all available transcripts

for transcript in transcript_list:
 
    # the Transcript object provides metadata

    # properties

    print(

        transcript.video_id,

        transcript.language,

        transcript.language_code,

        # whether it has been manually created or

        # generated by YouTube

        transcript.is_generated,

        # whether this transcript can be translated 

        # or not

        transcript.is_translatable,

        # a list of languages the transcript can be 

        # translated to

        transcript.translation_languages,

    )
 
    # fetch the actual transcript data

    print(transcript.fetch())
 
    # translating the transcript will return another

    # transcript object

    print(transcript.translate('en').fetch())
 
# you can also directly filter for the language you are
# looking for, using the transcript list

transcript = transcript_list.find_transcript(['en'])
 
# or just filter for manually created transcripts

transcript = transcript_list.find_manually_created_transcript(['en'])

Output:

Writing Subtitles to Text File

Now we would see how can we can write subtitles of a YouTube video in a text file. For that first, we would import the modules and then get the transcript or caption using the .get_transcript() function and store it into a variable. Then we would use the built-in file reader of python. The line uses a context manager so that we need not worry to close the file after our work is done. We open a file named subtitles.txt in write mode and then inside it we would iterate through each element of the list and then write it to the file. The code is as follows:

Python3

# importing modules

from youtube_transcript_api import YouTubeTranscriptApi
 
# using the srt variable with the list of dictionaries 
# obtained by the .get_transcript() function

srt = YouTubeTranscriptApi.get_transcript("SW14tOda_kI")
 
# creating or overwriting a file "subtitles.txt" with 
# the info inside the context manager

with open("subtitles.txt", "w") as f:

        # iterating through each element of list srt

    for i in srt:

        # writing each element of srt on a new line

        f.write("{}\n".format(i))

Output:

The file would be created in the same directory as that of the .py file if you just enter the name of the file in the context manager, to create/save it at a different location we need to give it’s absolute or relative path. Also, the program can generate error for unknown characters in the caption. However, the subtitle file will be created with known characters.

Article Tags :

Python

python-utility