← Back to Skills Marketplace
jimliu

Baoyu Youtube Transcript

by Jim Liu 宝玉 · GitHub ↗ · v1.103.1 · MIT-0
cross-platform ⚠ suspicious
708
Downloads
0
Stars
11
Active Installs
5
Versions
Install in OpenClaw
/install baoyu-youtube-transcript
Description
Downloads YouTube video transcripts/subtitles and cover images by URL or video ID. Supports multiple languages, translation, chapters, and speaker identifica...
README (SKILL.md)

YouTube Transcript

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly and automatically falls back to yt-dlp when YouTube blocks the direct API path.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

Script Directory

Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.

Script Purpose
scripts/main.ts Transcript download CLI

Usage

# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts \x3Cyoutube-url-or-id>

# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --languages zh,en,ja

# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --no-timestamps

# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --chapters

# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --speakers

# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --format srt

# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --translate zh-Hans

# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --list

# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts \x3Curl> --refresh

Options

Option Description Default
\x3Curl-or-id> YouTube URL or video ID (multiple allowed) Required
--languages \x3Ccodes> Language codes, comma-separated, in priority order en
--format \x3Cfmt> Output format: text, srt text
--translate \x3Ccode> Translate to specified language code
--list List available transcripts instead of fetching
--timestamps Include [HH:MM:SS → HH:MM:SS] timestamps per paragraph on
--no-timestamps Disable timestamps
--chapters Chapter segmentation from video description
--speakers Raw transcript with metadata for speaker identification
--exclude-generated Skip auto-generated transcripts
--exclude-manually-created Skip manually created transcripts
--refresh Force re-fetch, ignore cached data
-o, --output \x3Cpath> Save to specific file path auto-generated
--output-dir \x3Cdir> Base output directory youtube-transcript

Optional Environment Variables

Variable Description
YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER Passed to yt-dlp --cookies-from-browser during fallback, e.g. chrome, safari, firefox, or chrome:Profile 1

Input Formats

Accepts any of these as video input:

  • Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
  • Short URL: https://youtu.be/dQw4w9WgXcQ
  • Embed URL: https://www.youtube.com/embed/dQw4w9WgXcQ
  • Shorts URL: https://www.youtube.com/shorts/dQw4w9WgXcQ
  • Video ID: dQw4w9WgXcQ

Output Formats

Format Extension Description
text .md Markdown with frontmatter (incl. description), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
srt .srt SubRip subtitle format for video players

Output Directory

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)
  • {channel-slug}: Channel name in kebab-case
  • {title-full-slug}: Full video title in kebab-case

The --list mode outputs to stdout only (no file saved).

Caching

On first fetch, the script saves:

  • meta.json — video metadata, chapters, cover image path, language info
  • transcript-raw.json — raw transcript snippets from YouTube API ({ text, start, duration }[])
  • transcript-sentences.json — sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。?! etc.), timestamps proportionally allocated by character length, CJK-aware text merging
  • imgs/cover.jpg — video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.

When YouTube returns anti-bot / blocked responses on the direct InnerTube path, the script retries with alternate client identities and then falls back to yt-dlp if available. If fallback is needed but yt-dlp is unavailable, the agent should decide how to make yt-dlp available and continue rather than pushing the installation decision to the user.

SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.

Workflow

When user provides a YouTube URL and wants the transcript:

  1. Run with --list first if the user hasn't specified a language, to show available options
  2. Always single-quote the URL when running the script — zsh treats ? as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use 'https://www.youtube.com/watch?v=ID'
  3. Default: run with --chapters --speakers for the richest output (chapters + speaker identification)
  4. The script auto-saves cached data + output file and prints the file path
  5. For --speakers mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

Chapter & Speaker Workflow

Chapters (--chapters)

The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

Speaker Identification (--speakers)

Speaker identification requires AI processing. The script outputs a raw .md file containing:

  • YAML frontmatter with video metadata (title, channel, date, cover, description, language)
  • Video description (for speaker name extraction)
  • Chapter list from description (if available)
  • Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

  1. Read the saved .md file
  2. Read the prompt template at {baseDir}/prompts/speaker-transcript.md
  3. Process the raw transcript following the prompt:
    • Identify speakers using video metadata (title → guest, channel → host, description → names)
    • Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
    • Segment into chapters (use description chapters if available, else create from topic shifts)
    • Format with **Speaker Name:** labels, paragraph grouping (2-4 sentences), and [HH:MM:SS → HH:MM:SS] timestamps
  4. Overwrite the .md file with the processed transcript (keep the YAML frontmatter)

When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.

Error Cases

Error Meaning
Transcripts disabled Video has no captions at all
No transcript found Requested language not available
Video unavailable Video deleted, private, or region-locked
IP blocked Too many requests, try again later
Age restricted Video requires login for age verification
bot detected The script retries alternate clients and then yt-dlp; if fallback tooling is missing, the agent should resolve that itself, otherwise if it still fails try YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER=safari (or your browser)
Usage Guidance
This skill appears to do exactly what it says: fetch YouTube transcripts and thumbnails, cache them locally, and optionally fall back to yt-dlp. Things to consider before installing/running: - It will perform network requests to YouTube and write files under the output directory (default ./youtube-transcript). If you care about disk location or multi-user privacy, set --output-dir to a suitable path. - The code may spawn yt-dlp as a fallback (child_process.spawnSync is present). If yt-dlp is installed on your system, the skill may execute it; if you prefer not to allow that, remove or restrict yt-dlp. - The optional env var YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER allows passing browser cookie sources to yt-dlp; provide it only if you understand the implications. - Because the script extracts data from YouTube pages (including pulling an InnerTube API key from HTML), review the code yourself if you require higher assurance. Running in a sandbox or isolated environment is prudent for new skills. Overall: coherent and proportional to its purpose, with standard filesystem and network behavior for a downloader tool.
Capability Analysis
Type: OpenClaw Skill Name: baoyu-youtube-transcript Version: 1.103.1 The skill is a comprehensive YouTube transcript downloader that uses the InnerTube API with fallbacks to yt-dlp. It includes robust logic for parsing various subtitle formats (JSON3, WebVTT, SRT, XML), handling bot detection via multiple client identities (Android, iOS, Web), and caching metadata and images locally. While it utilizes potentially risky capabilities like shell execution (spawnSync for yt-dlp) and browser cookie access, these are strictly aligned with its stated purpose and include proper input sanitization (e.g., regex-based video ID extraction in shared.ts) to prevent exploitation. The prompt instructions in SKILL.md and speaker-transcript.md are task-oriented and do not exhibit malicious intent.
Capability Assessment
Purpose & Capability
The name/description (download YouTube transcripts and cover images) matches the included scripts and runtime instructions. Required binaries (bun or npx) are only for running the provided TypeScript scripts; no unrelated credentials or config paths are requested.
Instruction Scope
Instructions stay within the stated purpose but explicitly perform network requests to YouTube (InnerTube) and write output to a local cache/output directory (default: ./youtube-transcript). They also describe a fallback to yt-dlp and the ability to pass browser cookies to yt-dlp. These behaviors are expected for a transcript downloader but you should be aware the skill will: fetch HTML, extract an InnerTube API key from the page, call YouTube endpoints, download thumbnails, and create files under the chosen output directory.
Install Mechanism
There is no install spec (instruction-only in the registry), and the included source is executed via bun or npx. No remote archives or arbitrary downloads are performed by the installer. This is a low-risk install model; runtime network activity occurs when you run the script.
Credentials
The skill declares no required environment variables. It documents a single optional env var (YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER) used for yt-dlp cookies-from-browser fallback; this is reasonable and proportional to its stated fallback behavior. No unrelated secrets or cloud credentials are requested.
Persistence & Privilege
always: false and normal autonomous invocation are used. The skill writes cached files and thumbnails into a local directory it controls (youtube-transcript by default) and updates a local index (.index.json). It does not request system-wide privileges or modify other skills.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install baoyu-youtube-transcript
  3. After installation, invoke the skill by name or use /baoyu-youtube-transcript
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.103.1
## 1.103.1 - 2026-04-13 ### Fixes - `baoyu-markdown-to-html`: decode HTML entities and strip tags from article summary - `baoyu-post-to-weibo`: decode HTML entities and strip tags from article summary
v1.103.0
## 1.103.0 - 2026-04-12 ### Features - baoyu-diagram: add multi-diagram mode for article-wide diagram generation ### Fixes - baoyu-article-illustrator: prevent color names and hex codes from appearing as visible text in generated images - baoyu-cover-image: prevent color names and hex codes from appearing as visible text in generated images - baoyu-image-cards: prevent color names from appearing as visible text in generated images - baoyu-post-to-wechat: decode HTML entities and strip tags from article summary
v1.89.2
- Auto-retry with yt-dlp fallback when InnerTube returns empty transcript snippets
v1.81.0
### Features - Add yt-dlp fallback when YouTube blocks direct InnerTube API, with alternate client identity retry and cookie support via `YOUTUBE_TRANSCRIPT_COOKIES_FROM_BROWSER` env var ### Refactor - Split monolithic script into typed modules (youtube, transcript, storage, shared, types) and add unit tests
v1.76.1
## 1.76.1 - 2026-03-21 ### Documentation - Fix zsh glob issue: always single-quote YouTube URLs when running the script
Metadata
Slug baoyu-youtube-transcript
Version 1.103.1
License MIT-0
All-time Installs 11
Active Installs 11
Total Versions 5
Frequently Asked Questions

What is Baoyu Youtube Transcript?

Downloads YouTube video transcripts/subtitles and cover images by URL or video ID. Supports multiple languages, translation, chapters, and speaker identifica... It is an AI Agent Skill for Claude Code / OpenClaw, with 708 downloads so far.

How do I install Baoyu Youtube Transcript?

Run "/install baoyu-youtube-transcript" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Baoyu Youtube Transcript free?

Yes, Baoyu Youtube Transcript is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Baoyu Youtube Transcript support?

Baoyu Youtube Transcript is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Baoyu Youtube Transcript?

It is built and maintained by Jim Liu 宝玉 (@jimliu); the current version is v1.103.1.

💬 Comments