Python – Downloading captions from YouTube
Python provides a large set of APIs for the developer to choose from. Each and every service provided by Google has an associated API. Being one of them, YouTube Transcript API is very simple to use provides various features.
In this article, we will learn how to download captions or subtitles from a YouTube video. The subtitles can be auto-generated by YouTube or can be manually added by the mentor, in case it has both types available we would take a look at how to get specifically manual or automatic captions too. We would also explore how to get captions of specific languages and translate captions from one language to another. Then we would also see how to write the transcript in a text file.
youtube_transcript_api: This module is used for getting the captions/subtitles from a YouTube Video. It can be installed using:
pip install youtube-transcript-api # for windows or pip3 install youtube-transcript-api # for Linux and MacOs
Before starting with the process we would like to explain how we can get the video id of a YouTube video. For Example, if a YouTube video has the following URL
Then the video id for this video would be “SW14tOda_kI”, i.e. all the phrases after the ?v= counts as the video id. This is unique for each video on YouTube.
- Now we would start with the basics, In the first code snippet, we are trying to get the transcript of the video id using the .get_transcript() function.
- It returns us a list of dictionaries in which each dictionary contains 3 key-value pair inside it, the first one being the content, the second one being the time instant from which the caption sentence/phrase start to be spoken and the third one being the duration in seconds that is taken to speak the sentence or phrase completely.
- First-line basically imports the required packages and the next line assigns a variable to store the list of dictionaries and finally on the 3rd line it prints out the variable.
For getting transcripts of more than one video we can pass them using commas, as in YouTubeTranscriptApi.get_transcript(videoid1, Videoid2, ….), in this case, we would have a list of lists and gain inside each inner list we would have a dictionary.
Getting Transcript of a particular Language
Now if we want to get the transcript of a specific language, we can mention the language as a parameter. In the next code snippet, we aim to do the same. All the code and working would be the same as the previous example with the difference that this time it will get only the transcripts in English and ignore subtitles if any exists.
Since the video we are considering for this example, only has English subtitles so both the examples gave us the same answer.
Getting List of all transcripts
- Now to get the list of all transcripts of a video we can use the .list_transcripts() function. This function returns us all the transcripts of all the languages available for the video. It returns the TranscriptList object which is iterable and provides methods to filter the list of transcripts for specific languages and types.
- Next, we use functions to fetch some data about the transcript from the metadata obtained.
- transcript.video_id returns us the video ID of the video
- transcript.language returns us the language of the transcript
- transcript.language_code returns us the language cod of the transcript, for example, “en” for English, etc.
- transcript.is_generated tell us whether it has been manually created or generated by YouTube
- transcript.is_translatable tells whether this transcript can be translated or not
- transcript.translation_languages which give us a list of languages the transcript can be translated to.
- Then we use .fetch() function to fetch the actual transcript.
- Then we also showed how to use the .translate() function to convert/translate the caption from one language to another if at all it’s translatable (since we have only English subtitles for this language it might be not evident in this case, but this translation is very useful if there are transcripts of more than one language in the video).
- Next line we have the .find_transcript() function that helps us to get the actual transcript of the video we are wanting along with the metadata.
- Finally, we used the .find_manually_created_transcript() function to specifically find manual subscripts, similar to this we have .find_generated_transcript() which we have not used in this example since there are no generated captions, and we have only manual captions here.
Writing Subtitles to Text File
Now we would see how can we can write subtitles of a YouTube video in a text file. For that first, we would import the modules and then get the transcript or caption using the .get_transcript() function and store it into a variable. Then we would use the built-in file reader of python. The line uses a context manager so that we need not worry to close the file after our work is done. We open a file named subtitles.txt in write mode and then inside it we would iterate through each element of the list and then write it to the file. The code is as follows:
The file would be created in the same directory as that of the .py file if you just enter the name of the file in the context manager, to create/save it at a different location we need to give it’s absolute or relative path. Also, the program can generate error for unknown characters in the caption. However, the subtitle file will be created with known characters.