Wjs Overlaying Video
/install wjs-overlaying-video
wjs-overlaying-video
Post-production for a video clip: cover, captions, illustrations, CTA, custom motion graphics — all composed in ONE HyperFrames project and rendered in a SINGLE final encode. No cascade of decodes/re-encodes (each cascade pass degrades quality and burns time).
When to use
- Downstream of
/wjs-segmenting-video— the segmentation skill hands you cropped clips + per-clip SRTs; this skill turns them into upload-ready MP4s with cover/captions/illustrations/CTA. - User has a finished video and wants to dress it up with motion graphics: opening hook, key-quote callout, closing slogan, chapter cards, AI-generated cover as first frame.
- User wants HTML/CSS-quality captions on a video (kinetic word-by-word highlighting, custom fonts, large outlined text, seekable per cue).
- User wants illustration overlays at specific hook moments — diagrams, big text emphasis, flow charts.
Don't use for:
- Splitting one long video into clips → use
/wjs-segmenting-video. - Creating the source SRT → use
/wjs-transcribing-audio(then/wjs-translating-subtitlesif you need a different language). - Full HyperFrames productions where the source isn't a fixed video →
use
hyperframesdirectly. - 微信视频号 / 抖音 upload (no public API for those) → this skill produces the MP4; upload is manual.
What this skill IS — and IS NOT
| Is | Is not |
|---|---|
| Everything that goes ON TOP of a video clip: cover, caption, chapter, illustration, CTA | Cutting / cropping a video (that's /wjs-segmenting-video + /wjs-reframing-video) |
| One HyperFrames composition per clip = ONE final encode | A multi-step decode/encode cascade |
cover is the literal first frame of the output (platforms auto-pick it as thumbnail) |
A separate thumbnail file the user uploads alongside |
Captions are HTML/CSS — -webkit-text-stroke for white-on-anything readability |
libass burn-in (deprecated) |
Illustrations: re-usable stack / hammer patterns + custom escape hatch |
One bespoke HTML/CSS per illustration without re-use |
| AI covers regenerated at native target aspect (1024×1792 for vertical, 1536×1024 for horizontal) | Single 1024×1536 default that letterboxes or crops on the platform |
The pipeline
clip.mp4 + clip.zh-CN.burn.srt (from /wjs-segmenting-video hand-off)
↓
1. (Optional) Generate AI cover via gpt-image-2
make_cover.py --segments S.json --out output/ --size 1024x1792
cover_NN_slug.png
2. Scaffold a HyperFrames project per clip
hf_clip_NN/1080/{index.html, clip.mp4, cover.png, captions.json}
3. Compose: cover scene + body video + caption track + chapter chip
+ 1-2 illustrations at hook moments + CTA scene
4. npm run check (lint + validate + visual inspect)
npm run render → upload-ready MP4
A 2-minute vertical 1080×1920 composition renders in ~2-3 min on M-series Mac.
Standard overlay types (the 6 building blocks)
Every clip's final composition is built from some combination of these. The agent picks the right ones per clip — typically all 6 for a podcast highlight, or just 1-2 for a single annotation overlay.
1. cover — full-frame AI image as first frame
The cover IS the first frame (no animation, no zoom) so platforms that
auto-pick the first frame as the thumbnail get your designed cover by
default. Always verify with ffmpeg -ss 0 -vframes 1 — frame 0
must NOT be black or platform thumbnails will be black.
HTML:
\x3Cdiv id="cover" class="clip" data-start="0" data-duration="1.6"
data-track-index="1" data-layout-allow-overflow>
\x3Cimg src="cover.png" alt="" data-layout-allow-overflow />
\x3C/div>
CSS:
#cover { position: absolute; inset: 0; background: #0c0d10; overflow: hidden; }
#cover img { position: absolute; inset: 0; width: 100%; height: 100%; object-fit: cover; }
Generation: use /wjs-segmenting-video/scripts/make_cover.py
(wraps gpt-image-2 images edit with the midpoint frame as ref):
# For 1080×1920 vertical output (视频号 / 抖音):
make_cover.py --segments S.json --out output/ --size 1024x1792 [--single N]
# For 1920×1080 horizontal output (YouTube / B站):
make_cover.py --segments S.json --out output/ --size 1536x1024
Aspect must match output frame. --size 1024x1536 (2:3, the
script default) gets letterboxed or cropped on 9:16 output — always
pass 1024x1792 for vertical. The cover image's aspect is what the
viewer sees full-frame, so mismatch is visible. Re-roll one with
--single N; codex provider can transient-fail mid-batch.
Codex auth required: the script calls codex CLI via
gpt-image-2-skill. If ~/.codex/auth.json is missing, the script
errors. See gpt-image-2-skill for setup.
2. caption — outlined HTML/CSS captions synced to SRT
White text with thick black stroke, no bubble background, vertically centered in a fixed zone (so 1-line vs 2-line captions don't make the visual center jump up and down).
HTML:
\x3Cdiv id="caption" class="clip" data-start="{body_start}"
data-duration="{body_dur}" data-track-index="4">\x3C/div>
CSS (vertical 1080×1920):
#caption {
position: absolute; left: 0; right: 0; bottom: 240px;
height: 240px; z-index: 10; overflow: visible;
}
#caption .bubble {
position: absolute; top: 50%; left: 50%;
display: inline-block;
padding: 0 24px;
font-size: 56px; line-height: 1.18; font-weight: 900;
color: #ffffff; max-width: 1020px; text-align: center;
-webkit-text-stroke: 5px #000;
paint-order: stroke fill;
text-shadow: 0 6px 12px rgba(0,0,0,0.55), 0 0 4px rgba(0,0,0,0.6);
letter-spacing: 0.01em;
}
JS (one bubble per cue + GSAP fade in/out, all centered at container midpoint):
// SRT cues are loaded as inline JSON. Each cue's start/end is offset
// by the cover-scene duration (e.g., 1.5s) so the timing aligns with
// the composition timeline (not the body's own t=0).
const captionEl = document.getElementById("caption");
const groups = JSON.parse(document.getElementById("captions-data").textContent);
const bubbles = groups.map((g, i) => {
const b = document.createElement("span");
b.className = "bubble"; b.id = "cap-" + i;
b.textContent = g.text; b.style.opacity = "0";
captionEl.appendChild(b);
return b;
});
// GSAP xPercent/yPercent for centering (CSS transform would get
// overwritten the moment we tween y).
gsap.set(bubbles, { xPercent: -50, yPercent: -50 });
groups.forEach((g, i) => {
const el = bubbles[i];
tl.fromTo(el, { opacity: 0, y: 12 }, { opacity: 1, y: 0, duration: 0.18, ease: "power2.out" }, g.start);
const exitStart = Math.max(g.start + 0.18, g.end - 0.12);
tl.to(el, { opacity: 0, duration: 0.12, ease: "power2.in" }, exitStart);
tl.set(el, { opacity: 0 }, g.end);
});
Source SRT — slice + shift before inlining. Take
clip_NN.zh-CN.burn.srt from segmentation, parse each cue, add the
cover duration to every start/end, and inline as JSON in a
\x3Cscript id="captions-data" type="application/json"> block.
MarginV / position notes:
- Vertical (1080×1920):
bottom: 240pxkeeps captions clear of the 视频号/抖音 bottom UI overlay (likes/comments/share buttons). - Horizontal (1920×1080):
bottom: 100px,font-size: 48px,-webkit-text-stroke: 4pxis a reasonable default.
Caption length cap. If a single cue exceeds ~18 Chinese chars on
1080-wide at 56px, it wraps to 2 lines awkwardly. This is upstream
discipline — /wjs-translating-subtitles should cap cues at ~18 chars
using word-gap split + punctuation split. If you receive longer cues,
either reduce font-size to 48px or accept the wrap.
3. chapter — top-left chapter chip (4s reveal then fade)
A subtle badge identifying the segment. Enters at body start, fades after a few seconds so it doesn't compete with the rest of the composition.
HTML:
\x3Cdiv id="chapter" class="clip" data-start="{body_start}"
data-duration="{body_dur}" data-track-index="3">
\x3Cspan class="dot">\x3C/span>
\x3Cspan class="text">第一段 · 自然语言才是新代码\x3C/span>
\x3C/div>
CSS:
#chapter {
position: absolute; top: 80px; left: 60px; z-index: 9;
display: inline-flex; align-items: center; gap: 12px;
padding: 12px 20px;
background: rgba(12,13,16,0.78);
border: 1px solid rgba(199,150,85,0.4);
border-radius: 999px;
}
#chapter .dot { width: 10px; height: 10px; border-radius: 999px; background: #e8b063; }
#chapter .text {
font-size: 24px; color: #f4f4f5; letter-spacing: 0.04em; font-weight: 600;
}
GSAP:
tl.from("#chapter", { x: -40, opacity: 0, duration: 0.5, ease: "expo.out" }, body_start + 0.4);
tl.to("#chapter", { opacity: 0, duration: 0.4, ease: "power2.in" }, body_start + 4.0);
4. stack illustration — top-right vertical list card
A list of items (e.g., language hierarchy, workflow steps, levels) in a dark card at the top-right. One item can be accented in amber to highlight the relevant level/step.
Use for: showing a hierarchy or list while the speaker explains it. Card stays visible 8-50s.
HTML:
\x3Cdiv id="ill-stack" class="clip" data-start="{start}" data-duration="{dur}" data-track-index="5">
\x3Cdiv class="ill-card">
\x3Cdiv class="ill-card-label">我们写的层级\x3C/div>
\x3Cdiv class="ill-row">\x3Cspan class="ill-tag accent">自然语言\x3C/span>\x3C/div>
\x3Cdiv class="ill-row">\x3Cspan class="ill-tag">Python\x3C/span>\x3C/div>
\x3Cdiv class="ill-row">\x3Cspan class="ill-tag">C\x3C/span>\x3C/div>
\x3Cdiv class="ill-row">\x3Cspan class="ill-tag">Assembly\x3C/span>\x3C/div>
\x3C/div>
\x3C/div>
CSS: (see references/illustration_patterns.md for the full
canonical CSS — copy verbatim)
GSAP — slide in from right + stagger rows:
tl.fromTo("#ill-stack", { x: 360, opacity: 0 }, { x: 0, opacity: 1, duration: 0.6, ease: "expo.out" }, start + 0.2);
tl.from("#ill-stack .ill-row", { y: 20, opacity: 0, duration: 0.4, stagger: 0.12, ease: "power2.out" }, start + 0.4);
tl.to("#ill-stack", { x: 360, opacity: 0, duration: 0.5, ease: "power2.in" }, end - 0.5);
5. hammer illustration — center-frame big equation/text overlay
A BIG center-frame text/equation that visually "hammers" a key claim. Best for the single most quotable moment in a clip (e.g., "LLM = 编译器", "Token = 新 GDP", "AI ≠ 更快的轿子"). Visible 4–8s.
HTML:
\x3Cdiv id="ill-hammer" class="clip" data-start="{start}" data-duration="{dur}" data-track-index="6">
\x3Cdiv class="ill-h-content">
\x3Cdiv class="ill-h-eq">
\x3Cspan class="ill-h-left">LLM\x3C/span>
\x3Cspan class="ill-h-equals">=\x3C/span>
\x3Cspan class="ill-h-right">新编译器\x3C/span>
\x3C/div>
\x3Cdiv class="ill-h-foot">自然语言 → Python → 汇编\x3C/div>
\x3C/div>
\x3C/div>
GSAP — scale-pop entrance + stagger each piece + scale-fade exit:
tl.fromTo("#ill-hammer", { scale: 0.85, opacity: 0 },
{ scale: 1.0, opacity: 1, duration: 0.45, ease: "back.out(1.6)" }, start);
tl.from("#ill-hammer .ill-h-left", { x: -40, opacity: 0, duration: 0.4, ease: "expo.out" }, start + 0.2);
tl.from("#ill-hammer .ill-h-equals", { scale: 0, opacity: 0, duration: 0.4, ease: "back.out(2)" }, start + 0.4);
tl.from("#ill-hammer .ill-h-right", { x: 40, opacity: 0, duration: 0.4, ease: "expo.out" }, start + 0.6);
tl.from("#ill-hammer .ill-h-foot", { y: 20, opacity: 0, duration: 0.4, ease: "power2.out" }, start + 0.8);
tl.to("#ill-hammer", { scale: 1.05, opacity: 0, duration: 0.45, ease: "power2.in" }, end - 0.45);
(see references/illustration_patterns.md for full canonical CSS)
6. cta — end-card with channel CTA
A branded outro for the final 3 seconds. Use 王建硕 as the channel name (per global instructions) — never put a guest's name in the CTA slot.
HTML:
\x3Cdiv id="cta" class="clip" data-start="{cta_start}" data-duration="3.24" data-track-index="1">
\x3Cdiv class="cta-line-1">关注王建硕\x3C/div>
\x3Cdiv class="arrow">↓\x3C/div>
\x3Cdiv class="cta-line-2">微信公众号 · 视频号\x3C/div>
\x3Cdiv class="cta-foot">AI 炼金术 · 持续更新\x3C/div>
\x3C/div>
CSS / GSAP: see references/illustration_patterns.md.
Legacy types (for one-off overlays on a single video)
The spec.json + scaffold.py workflow also supports these older
overlay types — useful when you want to dress up ONE existing video
without going through the full post-production workflow above:
quote— full-width kinetic typography, top or bottom gradient. Best for opening hooks and key-quote callouts.slogan— alias forquotewithposition: bottomand larger type. Best for closing slogans.callout— small annotation panel in a corner. Best for chapter labels, lower-thirds, "as seen in" notes.custom— escape hatch. Claude writes the overlay's HTML/CSS/GSAP inside anoverlays/\x3Cname>.htmlfragment file. Seereferences/custom_overlay_recipes.md.
Workflow A — Post-segmentation preset (most common)
Use this when you're coming directly from /wjs-segmenting-video
and want the standard cover + caption + chapter + illustrations + CTA
treatment for each clip.
Step 1 — Generate AI covers at the right aspect
# For vertical 9:16 output (视频号 / 抖音):
python3 ~/.claude/skills/wjs-segmenting-video/scripts/make_cover.py \
--segments segments.json --out output/ --size 1024x1792 --single 1
# Verify segment 1's cover; then batch:
python3 ~/.claude/skills/wjs-segmenting-video/scripts/make_cover.py \
--segments segments.json --out output/ --size 1024x1792
Step 2 — For each clip, scaffold a HyperFrames project
hf_clip_NN/1080/ with:
index.html— the composition (from template; seereferences/post_segmentation_template.html)clip.mp4— copied fromoutput/clip_NN_slug.mp4cover.png— copied fromoutput/cover_NN_slug.pngcaptions.json— generated fromoutput/clip_NN_slug.zh-CN.burn.srtwith every cue's start/end shifted by +cover_duration (so cues align with the composition timeline, not the body's own clock)
The build script at references/build_hf_clips.py does this for all
segments in one pass. It reads segments.json + an
ILLUSTRATIONS dict (illustrations per clip, see Step 3) + the
template, and emits 5 ready-to-render projects.
Step 3 — Define illustrations per clip
For each clip, identify 1-2 hook moments and pick stack or hammer:
ILLUSTRATIONS = {
1: [
# The language hierarchy as a stack card during the opening
{"key": "stack", "pattern": "stack", "body_start": 0.3, "body_end": 9.0,
"label": "我们写的层级",
"rows": [
{"text": "自然语言", "accent": True},
{"text": "Python", "accent": False},
{"text": "C", "accent": False},
{"text": "Assembly", "accent": False},
]},
# The hammer at the most quotable moment
{"key": "hammer", "pattern": "hammer", "body_start": 10.8, "body_end": 14.6,
"left": "LLM", "equals": "=", "right": "新编译器",
"foot": "自然语言 → Python → 汇编"},
],
# ... clips 2-5
}
Timestamps are body-relative (after the cover-scene duration); the build script adds the cover offset when emitting GSAP positions.
Step 4 — Build + render
python3 references/build_hf_clips.py # scaffolds all projects
for n in 01 02 03 04 05; do
cd "hf_clip_$n/1080"
npx hyperframes lint
npx hyperframes validate
npx hyperframes render
cd ../..
done
A 2:30 clip renders in ~3 min. Output: hf_clip_NN/1080/renders/*.mp4.
Workflow B — Custom overlays on a single video (legacy spec.json)
Use this when you have ONE existing video and want to add a few ad-hoc overlays (title cards, annotations, lower-thirds).
spec.json schema
{
"source_video": "../path/to/source.mp4",
"duration": 135.4,
"size": "1920x1080",
"name": "clip_01_animated",
"overlays": [
{"id": "o1", "type": "quote", "start": 8.0, "duration": 6.0,
"position": "top", "lines": ["代码不存在错误", "只存在意图错配"],
"accent": [false, true]},
{"id": "o2", "type": "callout", "start": 30.0, "duration": 5.0,
"anchor": "top-right", "text": "FRP 概念"},
{"id": "o3", "type": "slogan", "start": 122.0, "duration": 13.4,
"lines": ["改 prompt", "不改 AI 生成的代码"], "accent": [false, true]}
]
}
| Field | Required | Notes |
|---|---|---|
source_video |
Yes | Path to source MP4. Symlinked into the project as source.mp4. |
duration |
Yes | Total composition length in seconds — match the source video. |
size |
No | WIDTHxHEIGHT (default 1920x1080). |
overlays[].type |
Yes | quote, slogan, callout, or custom. |
overlays[].start |
Yes | Start time in seconds. |
overlays[].duration |
Yes | How long the overlay is on screen. |
Scaffold + render
python3 ~/.claude/skills/wjs-overlaying-video/scripts/scaffold.py spec.json
cd \x3Cname> && npm run check && npm run render
Output checklist
Before considering a clip done:
- Frame 0 is the cover (not black) —
ffmpeg -ss 0 -vframes 1 out.mp4 - Captions are synced with audio (lint a few seconds with audio playback)
- All illustrations enter and exit at the speech moments they support
- CTA renders correctly (
关注王建硕, not a guest's name) -
npx hyperframes lint && npx hyperframes validateboth pass -
npx hyperframes inspectshows no layout overflow - Total duration matches the source clip + cover + CTA durations
Common mistakes
- Cover aspect ≠ output aspect.
1024x1536(the default make_cover.py size) is 2:3 and gets letterboxed or cropped on 9:16 output. Always pass--size 1024x1792for vertical. - Caption alignment jumps with line count. Anchor by CENTER (translate(-50%, -50%)) inside a fixed-height container so 1-line vs 2-line cues share the same visual midline. NOT anchored from bottom (causes growth-upward).
- GSAP overwrites CSS transform centering. If you set
transform: translate(-50%, -50%)in CSS and then tweeny, GSAP replaces the transform and centering breaks. Usegsap.set(el, { xPercent: -50, yPercent: -50 })instead so xPercent/yPercent compose with subsequent y/x tweens. - Burning libass subs on top of HTML/CSS captions. Pick ONE caption
system per output video. If you're using this skill's HTML/CSS
captions, do NOT also burn subs in
/wjs-segmenting-video— request the raw clip via the hand-off package. - Frame 0 is black. If your cover scene has an opacity fade-in
starting from 0, the literal first frame is black and the platform
thumbnail will be black. Place the cover statically (no opacity
tween) and verify with
ffmpeg -ss 0 -vframes 1. - Channel name in CTA = guest's name. Always use
王建硕. Guests belong in description text inside the metadata, not in the on-screen CTA. - Cover image cropped because of object-fit: cover on mismatched
aspect. Either regenerate the cover at the right aspect (see Step
- or letterbox with
object-fit: contain+ dark background.
- or letterbox with
Integration with other skills
/wjs-segmenting-video— the typical upstream. After it cuts- crops + slices SRTs, this skill picks up. The hand-off package is
clip_NN.mp4+clip_NN.zh-CN.burn.srt+segments.json.
- crops + slices SRTs, this skill picks up. The hand-off package is
/wjs-transcribing-audio+/wjs-translating-subtitles— if no SRT exists, run them first. The word-level Whisper or Volcano/豆包 ASR output is preferred for accurate cue timing.hyperframes— the underlying composition framework. This skill is a thin wrapper that encodes the proven post-production patterns; everything in thehyperframesskill applies (preview, render, transitions, audio-reactive, etc.). Read it whenever you writecustomoverlays.hyperframes-cli— the CLI commands the project uses (init,lint,validate,inspect,render).gpt-image-2-skill— the cover generator.make_cover.pyinvokes it via the codex CLI; the codex auth in~/.codex/auth.jsonis required./wjs-uploading-video— the next downstream after this skill produces an MP4. Uploads the renders to YouTube with title / description / tags from a metadata file.
Files & references
scripts/scaffold.py— Workflow B scaffolder (legacy spec.json for ad-hoc overlays)references/post_segmentation_template.html— Workflow A template: the canonicalcover + caption + chapter + illustration + CTAcomposition shape, with placeholder substitutionsreferences/build_hf_clips.py— Workflow A multi-clip builder. Readssegments.json+ per-clip illustrations dict, scaffolds and populates one project per clipreferences/illustration_patterns.md— canonical CSS / GSAP for thestackandhammerillustration patternsreferences/custom_overlay_recipes.md— reusablecustomoverlay recipes (terminal demo, layer-stack diagram, callout with arrow)references/example_spec.json— Workflow B example
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install wjs-overlaying-video - After installation, invoke the skill by name or use
/wjs-overlaying-video - Provide required inputs per the skill's parameter spec and get structured output
What is Wjs Overlaying Video?
Use when the user has one or more video clips and wants to add post-production on top — AI-generated cover as first frame, HTML/CSS captions synced to SRT, k... It is an AI Agent Skill for Claude Code / OpenClaw, with 64 downloads so far.
How do I install Wjs Overlaying Video?
Run "/install wjs-overlaying-video" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Wjs Overlaying Video free?
Yes, Wjs Overlaying Video is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Wjs Overlaying Video support?
Wjs Overlaying Video is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Wjs Overlaying Video?
It is built and maintained by Jian Shuo Wang (@jianshuo); the current version is v0.1.0.