Wjs Reframing Video
/install wjs-reframing-video
wjs-reframing-video
Convert a video's orientation by cropping a narrow band from the source — not by physically rotating it. The crop window follows the active speaker (the face whose mouth is moving), not just the largest or most-confident face. A .crop.json sidecar records the crop plan, the per-segment speaker decisions, and the parameters used. The original input is never modified.
When to use
- Repurposing a 16:9 podcast / interview / talk for vertical short-video platforms (WeChat Channels 视频号, Douyin 抖音, Xiaohongshu 小红书, YouTube Shorts, TikTok, Reels).
- Repurposing a 9:16 phone recording for horizontal players (YouTube long-form, blog embeds).
- Repurposing 4:3 archive footage for 3:4 mobile, or vice versa.
The output aspect is the source aspect with width and height swapped — 16:9 → 9:16, not "letterboxed 16:9 in a 9:16 frame".
When NOT to use
- Multi-person Q&A where each face needs its own crop — this skill picks one crop track per video. For per-speaker split renders, use wjs-editing-multicam instead.
- Animated content / B-roll with no faces — falls back to center crop, usually wrong for the intent.
- Heavy camera motion in the source (handheld pan/zoom) — the face tracker amplifies camera shake. Stabilize first.
- Source already at target aspect — no work to do.
What this skill IS — and IS NOT
| Is | Is not |
|---|---|
| Visual active-speaker detection via MAR (mouth-aspect-ratio) variance | Audio-visual fusion (audio energy + lip motion cross-correlated) |
| Stable face tracking across frames by center-distance matching | Re-identification across long gaps / occlusions |
| Speaker-aligned segments with hysteresis to prevent flicker | Frame-by-frame switching on every flicker |
--face-pick speaker (default) — pick whoever's mouth is moving |
--face-pick largest (opt-in legacy) — pick largest face |
Hard cuts between segments, fixed crop within each segment (--motion cut, default) |
Smooth panning that drifts during a speaker's turn (opt-in --motion smooth) |
| Audio stream-copy (bit-exact) | Audio reprocessing / re-encoding |
MediaPipe Tasks FaceLandmarker (478-pt mesh) at 5 fps sampled via ffmpeg |
Per-frame neural inpainting / out-painting |
One ffmpeg crop + scale pass |
Frame-by-frame Python compositor |
Falls back to "largest face" automatically when no one is talking (silence, music-only stretches).
Dependencies
pip install mediapipe opencv-python numpy
(MediaPipe lives outside the standard Python distribution; ffmpeg and ffprobe must be on PATH.)
First-run model download: MediaPipe 0.10+ uses the Tasks API, which needs a face_landmarker.task model file (~4 MB). On the first call, crop.py downloads it to ~/.claude/skills/wjs-reframing-video/models/ and caches it for subsequent runs. The script fails offline on first run.
Range limitation: The bundled landmarker is tuned for faces within ~2 m of the camera (selfie / podcast / interview distance). Wide event shots with small faces may not detect — sample a frame first to confirm.
Crop math
Source aspect = W / H. Target aspect = H / W (inverted). Compute crop window:
| Source orientation | Crop window |
|---|---|
| Horizontal (W > H) → Portrait | W_crop = H × H / W, H_crop = H (narrow vertical band) |
| Portrait (W \x3C H) → Horizontal | W_crop = W, H_crop = W × W / H (narrow horizontal band) |
For 1920×1080 → portrait, W_crop = 608, H_crop = 1080. Final scale to 1080×1920 (upscale ~1.78×).
For 1080×1920 → landscape, W_crop = 1080, H_crop = 608. Final scale to 1920×1080.
Override the final size via --output-size 1080x1920 if you want native crop dimensions instead of upscaling.
Pipeline
- Probe input dimensions, fps, duration via ffprobe.
- Decide orientation — auto from aspect (
--target portrait|landscapeto override). - Sample frames at
--sample-fps(default 5; high enough to catch mouth motion — Nyquist for speech is ~10 Hz, we need at least 4–5 fps). - Detect face landmarks per sampled frame with MediaPipe Tasks
FaceLandmarker(478 landmarks). For each detected face record: center, size proxy, MAR (mouth-aspect-ratio = inner-lip vertical distance / horizontal mouth-corner distance). - Track faces across frames by center-distance matching → each face gets a stable
face_id. - Per-sample active speaker: for each face track, variance of MAR over a sliding window (
--mar-var-window-sec, default 1 s). The face with the highest variance is "speaking". Below--mar-var-threshold, no one is speaking → fall back to largest face. - Hysteresis: a candidate switch only commits if the new speaker is stable for
--min-segment-sec(default 1.5 s). Shorter flickers are squashed — prevents the crop from ping-ponging on a one-frame mis-detection. - Speaker-aligned segments → for each segment, mean (cx, cy) of that speaker's face over the segment becomes the crop center, fixed for the full duration of the segment.
- Build a ffmpeg step-function expression (
--motion cut, default) that holds each segment's crop position constant and jumps instantly at each segment boundary — the visual feel of a real cut between camera angles. (--motion smoothswitches to piecewise-linear pan between segment midpoints; rarely the right call for talking-head content because the camera appears to drift mid-sentence.) - Render one ffmpeg pass —
crop=W:H:x='expr':y='expr', scale=OUT_W:OUT_H. The crop filter evaluatesxandyper frame natively. Audio stream-copied.
scripts/crop.py is the implementation. Output side effects:
\x3Cinput>.crop.json— sidecar with the crop plan\x3Cinput>_cropped.mp4— final cropped + scaled video
Sidecar schema (\x3Cinput>.crop.json)
{
"_about": "wjs-reframing-video crop plan for cam_a.MOV. Active-speaker detected via MAR variance.",
"_help": {
"source_size": "[width, height] in pixels.",
"target_size": "[width, height] of the final rendered output.",
"crop_window": "[width, height] of the moving crop in source coords.",
"chunks": "Speaker-aligned segments: {t0, t1, cx, cy, speaker_id}.",
"face_pick_mode": "speaker = MAR-variance active-speaker; largest = old behavior.",
"speaker_id": "Stable face track id. null means no face / silence fallback."
},
"schema_version": 2,
"source": "cam_a.MOV",
"source_size": [1920, 1080],
"target": "portrait",
"target_size": [1080, 1920],
"crop_window": [608, 1080],
"face_pick_mode": "speaker",
"sample_fps": 5.0,
"mar_var_window_sec": 1.0,
"mar_var_threshold": 1.5e-4,
"min_segment_sec": 1.5,
"chunks": [
{"t0": 0.0, "t1": 4.2, "cx": 808, "cy": 540, "speaker_id": 0},
{"t0": 4.2, "t1": 11.6, "cx": 1182, "cy": 540, "speaker_id": 1},
{"t0": 11.6, "t1": 14.0, "cx": 808, "cy": 540, "speaker_id": 0}
],
"face_sample_count": 1234,
"track_count": 2
}
Performance
- Detection is the slow step. On Apple Silicon at 2 fps sampling, expect ~10–20× realtime (a 30-min source detects in ~1–2 min). Bumping
--sample-fpsmakes detection slower but tracking more responsive. - Render is fast — single ffmpeg pass with hardware encode (
hevc_videotoolboxon macOS). Often \x3C1× realtime for a 1080p source. - For very long sources (>200 chunks), the ffmpeg expression gets cumbersome; the script auto-downsamples chunk midpoints to keep the expression under ~200 control points.
Common pitfalls
- Mouth gestures aren't speech — a yawn, laugh, eating, or sucking-in-air all raise MAR variance. The detector can briefly mistake these for talking. For high-stakes content, eyeball the speaker timeline in the sidecar (the script prints a
face#N: Xs on screen (Y%)summary) and re-run with a different--mar-var-thresholdif needed. - Side-profile or down-tilted faces — when a face is rotated >60° from camera, MediaPipe may fail to land mouth landmarks reliably, so MAR variance flatlines. The speaker fallback to "largest face" kicks in. If you have a long stretch of profile shots, consider
--face-pick largest. - Two faces with overlapping speech (interruption / talking over) — both faces have MAR variance, only one wins. The losing face is treated as listener. For accurate per-speaker tracking under crosstalk, use wjs-editing-multicam with separate cams.
- Long stretches of silence (B-roll, music) — falls back to largest face. If the largest face is wrong (e.g. a listener stays still while the speaker's mic feeds music), you'll see drift. Pre-segment around music-only sections.
- Source has burned-in lower-thirds / subtitles — for H→V, the lower band gets cropped out; for V→H, it stays but gets stretched. Strip burn-ins before running.
- Wide-angle / fish-eye lenses — landmarks miss faces near edges. Pre-correct distortion with
ffmpeg lenscorrectionfirst. - Upscaling artifacts —
608×1080 → 1080×1920is a 1.78× upscale and visible on sharp text. Render at native crop dims (--output-size 608x1080) and let the platform upscale, if you have overlays you want to keep sharp. - Output bitrate > platform limit — default is
--bitrate 12M. WeChat Channels (视频号) caps at 10 Mbps; pass--bitrate 8Mfor that target.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install wjs-reframing-video - After installation, invoke the skill by name or use
/wjs-reframing-video - Provide required inputs per the skill's parameter spec and get structured output
What is Wjs Reframing Video?
Use when the user wants to convert a video between horizontal and vertical orientations while preserving the inverted aspect ratio (16:9 ↔ 9:16, 4:3 ↔ 3:4, 2... It is an AI Agent Skill for Claude Code / OpenClaw, with 46 downloads so far.
How do I install Wjs Reframing Video?
Run "/install wjs-reframing-video" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Wjs Reframing Video free?
Yes, Wjs Reframing Video is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Wjs Reframing Video support?
Wjs Reframing Video is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Wjs Reframing Video?
It is built and maintained by Jian Shuo Wang (@jianshuo); the current version is v0.1.0.