Open In App

Python – Downloading captions from YouTube

Python provides a large set of APIs for the developer to choose from. Each and every service provided by Google has an associated API. Being one of them, YouTube Transcript API is very simple to use provides various features.

In this article, we will learn how to download captions or subtitles from a YouTube video. The subtitles can be auto-generated by YouTube or can be manually added by the mentor, in case it has both types available we would take a look at how to get specifically manual or automatic captions too. We would also explore how to get captions of specific languages and translate captions from one language to another. Then we would also see how to write the transcript in a text file.



youtube_transcript_api: This module is used for getting the captions/subtitles from a YouTube Video. It can be installed using:

pip install youtube-transcript-api # for windows
or 
pip3 install youtube-transcript-api # for Linux and MacOs 

Before starting with the process we would like to explain how we can get the video id of a YouTube video. For Example, if a YouTube video has the following URL



https://youtu.be/SW14tOda_kI

Then the video id for this video would be “SW14tOda_kI”, i.e. all the phrases after the ?v= counts as the video id. This is unique for each video on YouTube.

Getting Started




from youtube_transcript_api import YouTubeTranscriptApi
  
# assigning srt variable with the list
# of dictionaries obtained by the get_transcript() function
srt = YouTubeTranscriptApi.get_transcript("SW14tOda_kI")
  
# prints the result
print(srt)

 
 

Output:

 

 

For getting transcripts of more than one video we can pass them using commas, as in YouTubeTranscriptApi.get_transcript(videoid1, Videoid2, ….), in this case, we would have a list of lists and gain inside each inner list we would have a dictionary.

 

Getting Transcript of a particular Language

 

Now if we want to get the transcript of a specific language, we can mention the language as a parameter. In the next code snippet, we aim to do the same. All the code and working would be the same as the previous example with the difference that this time it will get only the transcripts in English and ignore subtitles if any exists. 

 




from youtube_transcript_api import YouTubeTranscriptApi
 
# assigning srt variable with the list of dictionaries
# obtained by the .get_transcript() function
# and this time it gets only the subtitles that
# are of english language.
srt = YouTubeTranscriptApi.get_transcript("SW14tOda_kI",
                                          languages=['en'])
 
# prints the result
print(srt)

Output:

Since the video we are considering for this example, only has English subtitles so both the examples gave us the same answer. 

Getting List of all transcripts




# importing the module
from youtube_transcript_api import YouTubeTranscriptApi
 
# retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts('SW14tOda_kI')
 
# iterate over all available transcripts
for transcript in transcript_list:
 
    # the Transcript object provides metadata
    # properties
    print(
        transcript.video_id,
        transcript.language,
        transcript.language_code,
       
        # whether it has been manually created or
        # generated by YouTube
        transcript.is_generated,
         
        # whether this transcript can be translated
        # or not
        transcript.is_translatable,
         
        # a list of languages the transcript can be
        # translated to
        transcript.translation_languages,
    )
 
    # fetch the actual transcript data
    print(transcript.fetch())
 
    # translating the transcript will return another
    # transcript object
    print(transcript.translate('en').fetch())
 
# you can also directly filter for the language you are
# looking for, using the transcript list
transcript = transcript_list.find_transcript(['en'])
 
# or just filter for manually created transcripts
transcript = transcript_list.find_manually_created_transcript(['en'])

Output:

Writing Subtitles to Text File

Now we would see how can we can write subtitles of a YouTube video in a text file. For that first, we would import the modules and then get the transcript or caption using the .get_transcript() function and store it into a variable. Then we would use the built-in file reader of python. The line uses a context manager so that we need not worry to close the file after our work is done. We open a file named subtitles.txt in write mode and then inside it we would iterate through each element of the list and then write it to the file. The code is as follows:




# importing modules
from youtube_transcript_api import YouTubeTranscriptApi
 
# using the srt variable with the list of dictionaries
# obtained by the .get_transcript() function
srt = YouTubeTranscriptApi.get_transcript("SW14tOda_kI")
 
# creating or overwriting a file "subtitles.txt" with
# the info inside the context manager
with open("subtitles.txt", "w") as f:
   
        # iterating through each element of list srt
    for i in srt:
        # writing each element of srt on a new line
        f.write("{}\n".format(i))

Output:

The file would be created in the same directory as that of the .py file if you just enter the name of the file in the context manager, to create/save it at a different location we need to give it’s absolute or relative path. Also, the program can generate error for unknown characters in the caption. However, the subtitle file will be created with known characters.


Article Tags :