How to Download YouTube Transcripts and Subtitles (SRT, VTT, TXT) in 2026

Q: What is the difference between SRT and VTT subtitle formats?

SRT (SubRip) is the most universally compatible format: a plain text file with numbered cue blocks, timestamps in HH:MM:SS,MS format, and subtitle text. VTT (Web Video Text Tracks) is the HTML5 web standard: similar structure but uses dot instead of comma in timestamps and supports CSS styling. SRT works in virtually all desktop video editors and players. VTT is preferred for web players and HTML5 video elements.

Q: What is the yt-dlp command to download subtitles?

To download auto-generated English subtitles without downloading the video: yt-dlp --write-auto-sub --sub-langs en --skip-download URL. To download manually created subtitles: yt-dlp --write-sub --sub-langs en --skip-download URL. To download all available subtitle languages: yt-dlp --write-sub --all-subs --skip-download URL. Add --convert-subs srt to force SRT output format regardless of what YouTube provides.

Q: How can I tell if a YouTube video has auto-generated or manual captions?

In the video's subtitle/CC menu (click CC on the player), manually created captions show as the plain language name (e.g. 'English'). Auto-generated captions show as 'English (auto-generated)' or similar. In the transcript panel, auto-captions tend to have no punctuation at sentence boundaries and may run words together without proper spacing. Manual captions are typically more accurate and properly formatted.

Q: Which tools can download transcripts from YouTube videos with no captions?

Videos with no captions require AI transcription. Tools like HappyScribe, Descript, and Whisper (open-source) can transcribe audio from a video URL or uploaded file. The process: download the video or audio file first, then upload to the AI transcription service. Whisper running locally is free and handles 99 languages with reasonable accuracy for clear speech.

Q: Can I legally republish a downloaded YouTube transcript?

Transcripts are derivative works of the original video content. The copyright to the spoken content belongs to the creator. Downloading for personal use, research, or study is generally fair use. Republishing verbatim transcripts without permission is likely infringement, even if the transcript itself was auto-generated by YouTube. Summarizing or quoting short passages with attribution is a different matter and generally permitted.

Q: What format should I use for captions in video editing software?

SRT is the safest choice for compatibility with video editing software. Premiere Pro, DaVinci Resolve, Final Cut Pro, and Capcut all import SRT natively. VTT works in web contexts and some editors. TXT (plain text with no timestamps) is useful for scripts and SEO but not for burning captions into video. When in doubt, SRT first.

YouTube's Built-in Transcript Panel

Before you install anything or paste a URL into a third-party tool, check whether YouTube already solves your problem. For most videos with any captions, it does.

How to open the transcript panel

There are two paths, depending on the YouTube interface version you are using:

Below the video: Click the three-dot menu (the one next to "Share" and "Download") directly below the video title. Look for "Show transcript" in the dropdown menu.
In the description: Some YouTube interface versions show a "Transcript" link in the expanded video description. Click it and the panel opens on the right side of the video.

The transcript panel shows the full text of the video, broken into time-stamped chunks that highlight as the video plays. Clicking any text chunk jumps the video to that moment. Useful. But the default panel includes the timestamps inline.

Removing timestamps from the transcript

Inside the transcript panel, look for the three-dot menu in the panel's top-right corner. Select "Toggle timestamps" to hide the time codes. Now the text is clean enough to copy into a document, notes app, or AI tool without stripping out timestamp text manually.

Copying the transcript text

With timestamps hidden, click inside the transcript panel and select all (Ctrl+A on Windows, Cmd+A on Mac). Copy. Paste wherever you need it. This is plain text with no file format, no structure, and no timing data. For reading, searching, summarizing, or feeding into an LLM, it is completely adequate.

Limitations of the built-in method

The built-in transcript panel does not download a file. You get text only. There is no SRT, no VTT, and no timestamps preserved in the copied text. If you need a timestamped subtitle file for use in a video editor, accessibility tool, translation software, or any application that expects a structured caption format, you need one of the methods in the next sections.

Additionally, the transcript panel only appears if the video has captions (manual or auto-generated). Videos with captions disabled by the creator, very new uploads where auto-captioning has not processed yet, and certain music videos have no transcript panel at all. For those, jump to the AI transcription section.

Auto-Captions vs Manual Subtitles

Not all YouTube captions are equal. The difference between auto-generated and manually created captions matters for every downstream use case.

How to tell which type a video has

Click the CC button on the YouTube player to open subtitle options. If you see the language listed as "English (auto-generated)" or with the phrase "auto-generated" in any language, YouTube's speech recognition created those captions. If you see just "English" (or another language name without the auto-generated note), a human created them, either the creator or a third party.

In the YouTube Studio dashboard, creators see this distinction clearly. For external viewers, the CC menu is the reliable indicator.

Accuracy differences

Auto-generated captions on YouTube are powered by Google's ASR (automatic speech recognition) engine. Accuracy is good for clear speech, standard accents, and common vocabulary in English. It degrades meaningfully for:

Heavy accents or regional dialects
Technical jargon, proper nouns, and brand names
Fast speech or overlapping voices
Music or background noise
Non-English languages (accuracy varies widely by language)

Auto-captions also have consistent formatting problems: missing punctuation, no sentence-end periods or commas, occasional run-on captions where multiple sentences merge, and proper nouns consistently lowercased. For casual use, this is a minor annoyance. For accessibility compliance, legal transcription, or content that requires accuracy, manual or AI-corrected transcripts are necessary.

When no captions exist

Many YouTube videos have no captions at all. This is common for older videos uploaded before YouTube introduced auto-captioning, for certain music videos where lyrics rights complicate automatic captioning, for videos in languages YouTube's ASR does not cover well, and for videos where creators explicitly disabled captions in Studio settings.

For these, you need external AI transcription, covered in detail in its own section below.

Format Guide: SRT vs VTT vs TXT

The format you download determines what you can do with the transcript. Choosing the wrong format for your use case is a common mistake that costs time.

Format	Timestamps?	Styling?	Software compatibility	Best for
SRT	Yes (HH:MM:SS,MS)	No	Near-universal (editors, players, upload)	Video editing, burning captions, YouTube upload
VTT	Yes (HH:MM:SS.MS)	Yes (CSS)	Web players, HTML5 video, some editors	Web embedding, HTML5 video, Vimeo upload
TXT	No	No	Any text editor, AI tools, SEO tools	Summarizing, SEO, newsletter content, quotes
SBV	Yes	No	YouTube-native, limited elsewhere	YouTube re-upload only
ASS/SSA	Yes	Yes (rich)	MPC, VLC, MKV containers	Anime subtitles, styled karaoke

SRT: the safe universal choice

SRT (SubRip Text) is the most compatible subtitle format in existence. It is a plain text file with a simple repeating structure: a cue number, a timestamp range, and one or more lines of subtitle text. Every major video editing application reads it. Premiere Pro, DaVinci Resolve, Final Cut Pro, Vegas Pro, Capcut, and CapCut Mobile all import SRT natively. YouTube accepts SRT for manual caption upload. When you are not sure which format to pick, pick SRT.

A typical SRT cue looks like this:

1
00:00:04,200 --> 00:00:06,800
Welcome to the channel.

2
00:00:07,100 --> 00:00:10,400
Today we are talking about subtitle formats.

VTT: the web standard

VTT (Web Video Text Tracks) is the HTML5 subtitle standard. The structure is nearly identical to SRT, with one key difference: the timestamp separator uses a period instead of a comma (00:00:04.200 vs 00:00:04,200). VTT also supports CSS styling classes, speaker identification, and metadata cues, which SRT does not.

If you are embedding video on a website and using the HTML5 <track> element, VTT is what you want. For everything else, SRT is simpler and better supported.

TXT: plain text, no structure

Plain text transcripts have no timestamps, no cue numbers, and no formatting. They read like a document. They are the right format for feeding into an AI summarizer, writing a blog post based on the video content, finding specific quotes, creating newsletter content, or building a knowledge base from video lectures. They are the wrong format for anything that needs timing information.

For repurposing YouTube content into written formats, TXT is the starting point. For language learning workflows where you need to synchronize text with audio, SRT is essential.

Tool Comparison Table

Here is the full comparison of available methods, including every meaningful criteria for choosing between them:

Tool	No install?	Formats	Batch?	Languages	AI transcription?	Cost
YouTube UI (transcript panel)	Yes	TXT only (copy-paste)	No	All available	No	Free
DownSub	Yes	SRT, VTT, TXT	Limited	All available	No	Free
NoteGPT / BibiGPT	Yes	TXT, summary	No	50+	Yes	Free tier + paid
Chrome extensions (VoiceInsight, etc.)	No (install required)	SRT, TXT	No	All available	No	Free/freemium
yt-dlp	No (CLI)	SRT, VTT, SBV, all	Yes (playlists)	All available	No	Free
HappyScribe	Yes	SRT, VTT, TXT, DOCX	Yes	140+	Yes	Paid (free trial)
Whisper (OpenAI, local)	No (Python install)	SRT, VTT, TXT, JSON	Yes (scripted)	99	Yes	Free (local compute)

DownSub: best no-install web tool for structured formats

DownSub accepts a YouTube URL and returns download links for SRT, VTT, and TXT files for all available subtitle tracks. It supports more than 50 video hosting sites beyond YouTube. No account required. The interface is straightforward: paste URL, click download, select language and format. For a one-off SRT download from a captioned video, this is the fastest no-install option.

DownSub does not perform AI transcription. If the video has no captions, DownSub has nothing to download. It only retrieves what YouTube already made available.

Chrome extensions: best for frequent watch-page use

Extensions like YouTube Subtitle Downloader and VoiceInsight add a download button directly on the YouTube watch page. Click the button and select your format. This is faster than copy-pasting or going to a web tool if you are regularly downloading transcripts as part of a research or content creation workflow. The tradeoff is maintaining the extension and dealing with occasional breakage after YouTube UI updates.

yt-dlp Subtitle Commands

yt-dlp is the most capable tool for programmatic or bulk subtitle downloads. The subtitle commands are well-documented but have enough flags to confuse people. Here are the complete commands for every common scenario.

Download auto-generated subtitles only (no video)

yt-dlp --write-auto-sub --sub-langs en --skip-download "YOUTUBE_URL"

--write-auto-sub: request the auto-generated caption track.
--sub-langs en: specify language code (en for English, ja for Japanese, es for Spanish, etc.).
--skip-download: skip the video file, download only the subtitle.

Download manually created subtitles only (no video)

yt-dlp --write-sub --sub-langs en --skip-download "YOUTUBE_URL"

--write-sub fetches manually created captions. If no manual captions exist, yt-dlp will not fall back to auto-generated unless you also add --write-auto-sub.

Download both auto-generated and manual captions

yt-dlp --write-sub --write-auto-sub --sub-langs en --skip-download "YOUTUBE_URL"

Download all available subtitle languages

yt-dlp --write-sub --all-subs --skip-download "YOUTUBE_URL"

This downloads every available language as a separate file. Useful for channels with multiple localized caption tracks.

Force SRT output format

yt-dlp --write-sub --sub-langs en --convert-subs srt --skip-download "YOUTUBE_URL"

--convert-subs srt converts whatever format YouTube provides into SRT. YouTube's native caption format is a variant of VTT with custom timing. Converting to SRT ensures compatibility with most video editors.

Batch download subtitles for a playlist

yt-dlp --write-auto-sub --sub-langs en --convert-subs srt --skip-download "https://youtube.com/playlist?list=PLAYLIST_ID"

yt-dlp handles playlists natively. Each video in the playlist gets its own subtitle file, named with the video title and ID. For a 50-video course or documentary series, this pulls all transcripts in a single command.

List available subtitle languages before downloading

yt-dlp --list-subs "YOUTUBE_URL"

This shows every caption track available for the video without downloading anything. Use it to check which language codes are available before specifying --sub-langs.

For users who also need the video file and subtitles together, the ffmpeg guide explains how to work with subtitle tracks embedded in video containers.

AI Transcription for Captionless Videos

A significant percentage of YouTube videos have no captions. Older content, smaller creators who have not set up Studio captioning, music videos, and content in lower-supported languages often lack any caption track. For these, you need AI speech recognition to generate a transcript from the audio.

How it works

AI transcription tools take audio or video as input and return text, usually with timestamps. The quality of the transcription depends on the audio quality, speech clarity, speaker accent, and the model used. Modern AI transcription in 2026 routinely achieves better than 90% word-error-rate accuracy for clean English speech. For heavy accents or technical vocabulary, expect 80-85% or lower.

The process for a YouTube video with no captions:

Download the video or audio. (yt-dlp handles this: yt-dlp -x --audio-format mp3 "YOUTUBE_URL")
Upload the audio file to your transcription tool, or point the tool at the URL directly.
Wait for transcription to complete. Speed varies: a 1-hour video typically processes in 3-8 minutes depending on the tool and your compute.
Download the output in your preferred format (SRT, VTT, TXT).

OpenAI Whisper (free, local)

Whisper is an open-source speech recognition model from OpenAI. It runs locally, which means your audio never leaves your machine. It supports 99 languages. It produces SRT, VTT, JSON, and TXT output.

# Install Whisper
pip install openai-whisper

# Transcribe a local audio file to SRT
whisper audio.mp3 --model medium --output_format srt

# Transcribe with language hint for non-English
whisper audio.mp3 --model medium --language ja --output_format srt

The medium model balances accuracy and speed for most use cases. The large-v3 model is more accurate but requires significant GPU memory and takes longer. On a modern GPU, a 1-hour video transcribes in approximately 5-10 minutes at medium quality. On CPU only, plan for 30-60 minutes per hour of audio.

HappyScribe: best web-based AI transcription

HappyScribe is a paid service that supports 140+ languages and accents, offers an interactive editor for correcting transcription errors, and exports to SRT, VTT, TXT, DOCX, and PDF. It can accept a YouTube URL directly, which skips the manual audio download step. Accuracy for English is competitive with Whisper large; for lower-resource languages it may outperform local Whisper models due to proprietary training data.

For occasional transcription needs, HappyScribe's per-minute pricing is reasonable. For high volume, Whisper running locally is more economical.

Other notable AI transcription tools

Descript combines transcription with a video editor where you edit video by editing the transcript text. It is a different workflow paradigm and works well for podcast and interview content. Otter.ai focuses on meeting transcription but handles YouTube audio. AssemblyAI provides a developer API for building transcription into applications.

Use Cases

Transcripts are more useful than most creators realize. Here is where they actually get used.

SEO and content repurposing

A YouTube transcript becomes a blog post scaffold. The structure is already there: the speaker moved from topic to topic in a logical order over the course of the video. Extracting the transcript, cleaning it up, adding subheadings, and rewriting for a reading audience produces a long-form article faster than writing from scratch. This is the core workflow behind the YouTube content repurposing strategies that content teams use consistently.

Google's search index also crawls YouTube caption text to understand what a video is about. Uploading a clean, accurate manual SRT to your YouTube video improves the likelihood that your content shows up for relevant search queries. Auto-captions are indexed but manual captions, being more accurate, tend to produce better keyword matches.

Translation and localization

An SRT file is the starting point for localizing a video into another language. Translators work from the SRT directly. The timing is already set; they translate the text within each cue block. The resulting translated SRT gets uploaded to YouTube as an additional caption track or burned into a dubbed video.

For language learning, having bilingual SRT files, one in your target language and one in your native language, creates a powerful study resource. Synchronized text in two languages while listening to native audio is a proven method for building comprehension.

Accessibility compliance

Businesses, educational institutions, and governments face legal requirements to provide accessible video content. In the United States, the ADA and Section 508 require accurate captions for video content used in certain contexts. Auto-generated captions alone do not reliably meet accuracy standards. Having a corrected SRT file, whether manually created or AI-generated and reviewed, is the documented standard for compliance.

Research, journalism, and quoting

Researchers and journalists working with video content need to find specific quotes, verify what was said, and cite accurately. A text transcript makes search trivial. Ctrl+F for a keyword in a 3-hour transcript takes a second. Scrubbing through audio to find a moment takes far longer. Legal contexts (litigation, regulatory proceedings, arbitration) sometimes require verbatim transcripts of video content. AI-generated transcripts with human review have been used in these contexts when official court reporters are not involved.

Newsletter content and podcast show notes

Podcast show notes are often written by feeding the transcript to an AI tool and asking for a structured summary with timestamps for key moments. The transcript provides the raw material; the AI organizes it. The result is a show notes page that search engines can index and listeners can skim to decide whether an episode is relevant to them.

Newsletter writers who cover YouTube channels or podcast-style content regularly pull transcripts as source material. Summarizing what a creator said, with accurate attribution, is much easier when you have the text.

Legal Note

Transcripts are derivative works. The spoken content in a YouTube video belongs to the creator or whoever holds the copyright to that content. Downloading a transcript for personal use (reading, research, study, note-taking) is generally considered fair use in most jurisdictions. No enforcement actions have been brought against individuals for downloading transcripts for personal use.

Republishing verbatim transcripts creates a different legal situation. If you take a creator's full transcript and publish it on your own website or in a book, you are reproducing their copyrighted content in full. The auto-generated nature of the caption does not affect this. YouTube did the transcription work, but the words belong to the speaker.

The legal questions become more nuanced with fair use analysis: how much of the work did you reproduce, for what purpose, and does it affect the market for the original? Short quotes with attribution for commentary and criticism are standard fair use. Verbatim reproduction of an entire video's transcript is not.

For commercial uses (building a product that processes YouTube transcripts at scale, publishing translated versions, training AI on creator content), get proper licensing or legal advice specific to your situation.

Frequently Asked Questions

How do I download a YouTube transcript without any tools?

Open the video on YouTube, click the three-dot menu below the video title, and select "Show transcript." Inside the transcript panel, use the three-dot menu to toggle timestamps off if you want clean text. Then select all and copy. You get plain text, no file, but it works for most reading and summarizing purposes without installing anything.

What is the difference between SRT and VTT subtitle formats?

SRT is the universal standard for desktop software, video editors, and YouTube uploads. VTT is the HTML5 web standard and adds CSS styling support. Both contain timestamped subtitle text. The main structural difference is the timestamp format: SRT uses comma as the millisecond separator, VTT uses a period. When in doubt, use SRT. It works everywhere VTT works, and then some.

What is the yt-dlp command to download subtitles?

For auto-generated English subtitles without downloading the video: yt-dlp --write-auto-sub --sub-langs en --skip-download URL. For manual captions: replace --write-auto-sub with --write-sub. Add --convert-subs srt to force SRT output. For a full playlist, pass the playlist URL instead of a single video URL.

How can I tell if a YouTube video has auto-generated or manual captions?

Click the CC (closed captions) button on the YouTube player. If the language shows as "English (auto-generated)" or any variant with "auto-generated," YouTube's speech recognition created those captions. If it just shows the language name without the qualifier, a human created them. In the transcript text itself, auto-captions typically lack punctuation and have inconsistent capitalization.

Which tools can download transcripts from YouTube videos with no captions?

Videos with no captions require AI speech recognition. OpenAI's Whisper (free, runs locally, 99 languages) is the most accessible free option. HappyScribe and Descript are web-based paid services that accept YouTube URLs directly. The workflow: download the video's audio with yt-dlp, then feed the audio file to Whisper or upload it to a web service.

Can I legally republish a downloaded YouTube transcript?

The spoken content belongs to the creator regardless of who transcribed it. Downloading for personal use is generally fine. Republishing verbatim transcripts without permission is likely copyright infringement, even if YouTube generated the captions automatically. Quoting short passages with attribution for commentary or analysis is a different matter and typically falls under fair use.

What format should I use for captions in video editing software?

SRT. Premiere Pro, DaVinci Resolve, Final Cut Pro, Vegas Pro, and Capcut all import SRT natively. VTT works in some editors but not all. TXT has no timing information so it cannot be used as a subtitle track. If you downloaded a VTT or another format and your editor does not recognize it, convert it to SRT using a free converter like SubtitleConverter.net or the ffmpeg command ffmpeg -i input.vtt output.srt.

The transcript is the text layer beneath every YouTube video. Getting it out is rarely complicated once you know which tool to use for which situation. For captioned videos, DownSub and yt-dlp handle everything. For captionless videos, Whisper handles the heavy lifting locally and free. The built-in YouTube transcript panel handles the quick copy-paste case without any installation.

Once you have the transcript, the video cutting workflow becomes faster. You can search the transcript for a specific line, find the timestamp, and cut to that exact moment without scrubbing blindly through the timeline.