Description

Add word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering.

README (SKILL.md)

Remotion Word Highlight Subtitles

Name: Remotion Word Highlight Subtitles
Author: 0x00000003

Overview

This skill turns a local video into a subtitled video using the reusable "fine version": Whisper word timestamps plus Remotion-rendered current-word highlighting. Use this instead of plain SRT burn-in unless the user explicitly asks for simple static subtitles.

Detailed usage guide, README, and effect preview: https://github.com/0x00000003/remotion-word-highlight-subtitles

Trigger Phrases

Use this skill for requests like:

"给这个视频添加字幕：/path/to/video.mp4"
"给这个视频加逐词高亮字幕"
"按之前 Remotion 那个字幕方案处理这个视频"
"重新转写 word timestamps，然后加当前词高亮字幕"

If the user only gives one video path, write the output next to the source video.

Defaults

Source: local .mp4, .mov, .m4v, or audio/video file with usable audio.
Output: \x3Csource-stem>_remotion逐词高亮字幕.mp4 in the same directory unless the user names another target.
Transcription: Whisper with word timestamps, Chinese by default: --language zh --word_timestamps True --output_format json.
Transcript QA: mandatory before rendering. Build a tiny glossary from the user prompt, folder/file names, visible context, and topic; correct obvious ASR mistakes in product/model/person/place names, English terms, numbers, and homophones.
Caption data: Remotion caption chunks built from Whisper words, with token-level startMs and endMs. Long Whisper segments must be split by punctuation, timing gaps, duration, or length before rendering.
Caption position: keep all captions in a screenshot-friendly lower band, slightly above the bottom UI area. Start around height * 0.28 bottom padding, then adjust only if the source framing clearly needs it.
Visual style: bold Chinese UI font, white base text, current token yellow, optional sparse keyword accent, and black shadow/outline for readability.
Verification: check the rendered file exists, keeps audio, matches the source duration closely, and visually inspect stills before accepting the style.

Workflow

Inspect the source with ffprobe for width, height, fps, duration, and audio presence.
Build a small domain glossary before transcription. Include likely product names, model names, people, places, mixed Chinese/English terms, numbers, and terms hinted by the folder or filename.
Run Whisper word timestamp transcription. Prefer a cached/local model that works on the machine; turbo is a good default when available. Use --initial_prompt when the glossary contains high-risk names or English terms.
Do a mandatory transcript QA pass before rendering. Read every segment, compare it with the video's topic/context, and create a correction map for obvious ASR mistakes. Do not render while known mistakes remain.
Convert the Whisper JSON to public/captions.json using scripts/whisper_json_to_captions.py, passing corrections with --replace or --replace-phrase. If Whisper returns long paragraphs, keep the helper's caption splitting enabled.
Build or reuse a small Remotion project in the video's folder. Copy the video to public/input.mp4 or encode an H.264 compatibility copy as public/input-h264.mp4.
Set the Remotion composition to the exact source width, height, fps, and duration in frames.
Before the full render, render and inspect at least two stills from the Remotion composition: one long/two-line caption and one dense keyword/current-token caption. Fix muddy text, excessive outline, large dark backing, bad line breaks, poor placement, or remaining typo/wrong-character subtitles before rendering the final video.
Render with Remotion to the output path next to the source.
Verify the final output with ffprobe; extract and inspect a still from the rendered video to confirm the final encoded file kept the approved subtitle look.

Whisper Command

Use this shape, adjusting model and paths as needed:

whisper "/absolute/path/video.mp4" --model turbo --language zh --word_timestamps True --output_format json --output_dir "/absolute/path"

For names and mixed Chinese/English topics, add a short prompt rather than relying on Whisper to infer them:

whisper "/absolute/path/video.mp4" \
  --model turbo \
  --language zh \
  --word_timestamps True \
  --output_format json \
  --initial_prompt "本视频可能出现这些词：Cursor、Kimi 2.5、马斯克、AI大模型、转推、套壳。" \
  --output_dir "/absolute/path"

If Whisper fails on the video container, extract audio first:

ffmpeg -y -i "/absolute/path/video.mp4" -vn -ac 1 -ar 16000 "/absolute/path/video_audio.wav"

Then run Whisper on the WAV and keep the output naming clear.

Mandatory Transcript QA

After Whisper finishes, print or read the segment transcript before building captions. Treat this as a required gate, not a nice-to-have.

Check especially:

product, model, company, person, and place names
mixed Chinese/English terms such as Cursor, Kimi 2.5, ChatGPT, Claude, OpenAI
numbers, dates, model versions, and acronyms
Chinese homophones that are plausible but wrong in context, such as "却" versus "圈" or "死腿" versus "转推"
low-confidence words if the Whisper JSON includes probabilities

Create a short correction map and apply it before rendering. Use --replace for single-token fixes and --replace-phrase for words split across adjacent Whisper tokens.

Example:

python3 scripts/whisper_json_to_captions.py \
  "/absolute/path/transcript.json" \
  "/absolute/path/remotion-project/public/captions.json" \
  --replace-phrase "科舍=Cursor" \
  --replace-phrase "KMI 2.5=Kimi 2.5" \
  --replace-phrase "Kimi 2.5=Kimi 2.5" \
  --replace-phrase "AI却=AI圈" \
  --replace-phrase "死腿=转推" \
  --merge-term "Cursor" \
  --merge-term "Kimi 2.5" \
  --keyword "AI" \
  --keyword "Kimi 2.5"

If a correction is uncertain, prefer rerunning Whisper with a better --initial_prompt or inspect the relevant audio/video moment before deciding. Report the correction map in the final response.

Caption JSON

Remotion should load public/captions.json with this shape:

[
  {
    "text": "我们用手机随便拍张照片",
    "startMs": 0,
    "endMs": 1600,
    "tokens": [
      { "text": "我们", "startMs": 0, "endMs": 300, "keyword": false },
      { "text": "手机", "startMs": 440, "endMs": 700, "keyword": true }
    ]
  }
]

Use the helper script:

python3 scripts/whisper_json_to_captions.py \
  "/absolute/path/transcript.json" \
  "/absolute/path/remotion-project/public/captions.json" \
  --keyword "提示词" \
  --keyword "Codex" \
  --replace-phrase "错识别词=正确词" \
  --max-caption-chars 28 \
  --max-caption-duration-ms 4200 \
  --split-gap-ms 260 \
  --min-punctuation-caption-ms 900

Run the transcript QA pass before this conversion command. If any correction changes adjacent tokens into one display term, also pass that final term with --merge-term or a same-text --replace-phrase so the highlight appears as a clean word instead of broken characters.

The helper splits long Whisper segments by visible length, duration, punctuation, and word timing gaps. Keep this behavior on for short-video subtitles; otherwise a single Whisper paragraph can become an unreadable multi-line caption.

Remotion Caption Layer Requirements

The caption layer should:

Use OffthreadVideo for the source video.
Load captions.json with delayRender, continueRender, and staticFile.
Find the active caption by currentMs >= startMs && currentMs \x3C endMs.
Highlight the active token when currentMs is within the token's timing.
Use keyword coloring only as a secondary accent; the current spoken token is the main effect.
Keep letterSpacing: 0.
Keep all captions at the normal position; do not include special screenshot-sentence placement unless the user explicitly asks.
Do not use WebkitTextStroke as a thick Chinese subtitle outline. It easily eats the white fill at small resolutions. Prefer multi-direction textShadow; if WebkitTextStroke is used at all, keep it at or below 1.5px and verify a still.
Do not use a large semi-transparent rounded black rectangle behind the whole caption by default. If the footage truly needs backing, use a very subtle per-line backing, opacity \x3C= 0.16, tight padding, and verify it does not look like a dark banner.

Reject and revise any still where the caption has muddy/gray text, a thick black halo, a large black box, clipped words, awkward wrapping, or placement over the mouth/chin in a talking-head video.

Use these style constants as the baseline:

const captionBottom = Math.round(height * 0.28);
const captionFontSize = Math.round(height * 0.032);
const captionMaxWidth = Math.round(width * 0.88);
const activeColor = "#FFE45C";
const keywordColor = "#D6FFF8";
const outlinePx = Math.min(3, Math.max(1.5, captionFontSize * 0.055));

Use a clean shadow outline rather than a heavy stroke:

const textShadow = [
  `${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
  `-${outlinePx}px 0 0 rgba(0, 0, 0, 0.96)`,
  `0 ${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
  `0 -${outlinePx}px 0 rgba(0, 0, 0, 0.96)`,
  `0 ${outlinePx * 1.4}px ${outlinePx * 1.4}px rgba(0, 0, 0, 0.72)`,
].join(", ");

For a 720x1280 vertical video, this is roughly paddingBottom: 358, fontSize: 41, maxWidth: 634, and a 2px outline. For the original 2160x2974 source that inspired this skill, the equivalent bottom padding was about 830.

Keyword Handling

If the user does not specify keywords, infer a small set from the transcript:

product/tool names: Codex, Remotion, Whisper
content objects: 提示词, 照片, 手机, 封面
place names or nouns that carry the video's hook
numbers or outcomes that are important to the promise

Keep keyword highlighting sparse. Too many colored words makes the active yellow word less clear.

Output Naming

Prefer:

\x3Csource-stem>_remotion逐词高亮字幕.mp4

If iterating for the same source, add _v2, _v3, etc. Do not overwrite the user's source file.

Final Response

Tell the user the output path, mention that it used the reusable Remotion + Whisper word timestamp flow, and include the transcript correction map or say no corrections were needed. If any verification step could not be run, say that plainly.

Usage Guidance

This skill looks reasonable for local subtitle generation. Before installing or using it, make sure the required local tools are from trusted sources, provide only the video path you want processed, confirm where output files will be written, and delete intermediate transcript/caption files if the video contains sensitive speech.

Capability Analysis

Type: OpenClaw Skill Name: remotion-word-highlight-subtitles Version: 0.1.3 The skill bundle provides a legitimate workflow for generating word-level highlighted subtitles using Whisper and Remotion. The included Python script (scripts/whisper_json_to_captions.py) is a utility for transforming transcription data, and the SKILL.md instructions guide the agent through standard video processing steps (ffprobe, whisper, ffmpeg). No evidence of data exfiltration, malicious persistence, or harmful prompt injection was found.

Capability Assessment

✓ Purpose & Capability

The declared purpose—adding word-level highlighted subtitles to local videos—matches the required tools, workflow, README, and bundled caption-conversion script.

ℹ Instruction Scope

The skill instructs the agent to run ffprobe, Whisper, ffmpeg, Python, and Remotion-related commands and to write outputs next to the source video. This is purpose-aligned but should remain tied to a user-provided video path.

ℹ Install Mechanism

There is no install spec, and the registry source/homepage are not provided. The workflow depends on locally installed binaries and Remotion/npm project dependencies, so users should ensure those tools are trusted.

ℹ Credentials

The skill reads local video/audio content and may use folder/file names or visible context to improve transcription. This is proportionate for subtitle generation but may involve private media content.

ℹ Persistence & Privilege

The skill creates persistent local artifacts such as Whisper JSON, captions.json, copied/encoded input video files, and rendered output video. No credentials, background service, or privilege escalation are shown.

Version History

v0.1.3

Make transcript QA mandatory before rendering; add cross-token phrase corrections and readable splitting for long Whisper segments. Details: https://github.com/0x00000003/remotion-word-highlight-subtitles

v0.1.2

Use the GitHub-hosted effect preview image in README while keeping the GitHub guide link in the skill instructions.

v0.1.1

Add GitHub README and effect preview link to the skill instructions.

v0.1.0

Initial public release: Whisper word timestamps to Remotion current-word highlighted subtitles.

Metadata

Slug remotion-word-highlight-subtitles

Version 0.1.3

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 4

Frequently Asked Questions

What is Remotion Word Highlight Subtitles?

Add word-level highlighted subtitles to local short videos using Whisper word timestamps and Remotion rendering. It is an AI Agent Skill for Claude Code / OpenClaw, with 91 downloads so far.

How do I install Remotion Word Highlight Subtitles?

Run "/install remotion-word-highlight-subtitles" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Remotion Word Highlight Subtitles free?

Yes, Remotion Word Highlight Subtitles is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Remotion Word Highlight Subtitles support?

Remotion Word Highlight Subtitles is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Remotion Word Highlight Subtitles?

It is built and maintained by 0x00000003 (@0x00000003); the current version is v0.1.3.

More Skills

Remotion Word Highlight Subtitles

Remotion Word Highlight Subtitles

Overview

Trigger Phrases

Defaults

Workflow

Whisper Command

Mandatory Transcript QA

Caption JSON

Remotion Caption Layer Requirements

Keyword Handling

Output Naming

Final Response

What is Remotion Word Highlight Subtitles?

How do I install Remotion Word Highlight Subtitles?

Is Remotion Word Highlight Subtitles free?

Which platforms does Remotion Word Highlight Subtitles support?

Who created Remotion Word Highlight Subtitles?

💬 Comments