Face Tracking for Vertical Video: Why It's Harder Than It Looks (And How It Works)
Face Tracking for Vertical Video: Why It's Harder Than It Looks (And How It Works) Reformatting landscape video to vertical sounds like a simple crop operation. It isn't. The technical challenge of keeping a human face consistently centered in a 9:16 frame while the subject moves, gestures, walks, and shifts position is what separates professional-quality vertical video from the static-center-crop approach that makes content feel wrong. Here's what actually goes into solving this problem. The Naive Approach and Why It Fails The easiest implementation is a fixed center crop: take the middle third of a 16:9 frame and export it as 9:16. For static, tripod-mounted interviews, this works. For anything else, it fails in ways that compound quickly. A host who leans left to emphasize a point. A gu
Face Tracking for Vertical Video: Why It's Harder Than It Looks (And How It Works)
Reformatting landscape video to vertical sounds like a simple crop operation. It isn't. The technical challenge of keeping a human face consistently centered in a 9:16 frame while the subject moves, gestures, walks, and shifts position is what separates professional-quality vertical video from the static-center-crop approach that makes content feel wrong.
Here's what actually goes into solving this problem.
The Naive Approach and Why It Fails
The easiest implementation is a fixed center crop: take the middle third of a 16:9 frame and export it as 9:16. For static, tripod-mounted interviews, this works. For anything else, it fails in ways that compound quickly.
A host who leans left to emphasize a point. A guest who turns to face their interviewer. A speaker who walks across a stage. A YouTuber who gestures widely. In each case, the face exits the crop window, and you're left with a vertical video showing someone's shoulder or an empty background while the person is clearly talking.
Viewers register this immediately, even if they don't articulate it. The clip feels amateurish. Watch time drops in the first 5 seconds.
What Good Face Tracking Actually Requires
The production-quality approach runs face detection on every frame of the video, tracks the position of the primary speaker's face, and dynamically repositions a crop window that follows the subject while staying within the bounds of the original frame.
The components:
Face detection: Identifying where faces are in each frame. Modern solutions like MediaPipe Face Detection run in real-time and handle varying distances, angles, and partial occlusion reliably. For single-speaker content, you're typically tracking one face — the primary presenter.
Subject persistence: If multiple faces appear, the system needs to maintain a consistent "primary" target rather than jumping between subjects. For interview content, this usually means tracking whoever is speaking, which requires some audio-to-speaker correlation or a simpler "largest face in frame" heuristic.
Smooth camera movement: Raw face-position data is noisy. If the crop window snaps to each new face position exactly, the "camera" shakes constantly. The solution is a smoothing pass — exponential moving average or low-pass filter applied to the x,y position stream before converting to crop coordinates.
def smooth_positions(positions, alpha=0.15): """Exponential moving average smoothing for face tracking data""" smoothed = [] prev = positions[0] for pos in positions: smoothed_pos = alpha * pos + (1 - alpha) * prev smoothed.append(smoothed_pos) prev = smoothed_pos return smootheddef smooth_positions(positions, alpha=0.15): """Exponential moving average smoothing for face tracking data""" smoothed = [] prev = positions[0] for pos in positions: smoothed_pos = alpha * pos + (1 - alpha) * prev smoothed.append(smoothed_pos) prev = smoothed_pos return smoothedEnter fullscreen mode
Exit fullscreen mode
The alpha value controls responsiveness vs. smoothness. Lower alpha = smoother movement but slower to follow fast subject movement. 0.1-0.2 works well for most talking-head content.
Boundary handling: The crop window can't go outside the original frame. When the subject approaches the edge, the window needs to stop at the boundary rather than following further. This creates a "camera at the limit" effect rather than a hard-edge violation.
The FFmpeg Implementation
Once you have a stream of (time, crop_x, crop_y) values, FFmpeg's crop filter with keyframe expressions handles the actual reframing:
ffmpeg -i input.mp4 \ -vf "crop=w=608:h=1080:x='if(lt(t,0.5),320,if(lt(t,1.0),280,360))':y=0" \ -c:v libx264 -crf 23 output.mp4ffmpeg -i input.mp4 \ -vf "crop=w=608:h=1080:x='if(lt(t,0.5),320,if(lt(t,1.0),280,360))':y=0" \ -c:v libx264 -crf 23 output.mp4Enter fullscreen mode
Exit fullscreen mode
For production use, you'd generate the crop expression programmatically from your face-position data rather than hardcoding values. The 'expr' syntax in FFmpeg allows time-based expressions that interpolate between keyframes.
Combining crop with caption overlay in a single FFmpeg pass saves significant encoding time versus piping through multiple stages.
Where This Gets Deployed
At ClipSpeedAI, this pipeline runs on every video processed. The face tracking stage takes 20-40% of total job processing time depending on video resolution and length. For 1080p, 60-minute videos, the face detection pass alone processes roughly 108,000 frames at 30fps.
Several optimizations reduce this to practical timeframes:
-
Process at downscaled resolution (480p) for detection, apply results to full-resolution crop
-
Skip frames using temporal sampling — detect every 3rd frame and interpolate between detections
-
Run detection asynchronously in chunks rather than sequentially
The result is vertical video that feels like a camera operator was actually following the subject — the standard that viewers now expect from professional short-form content.
The Practical Impact
Creators who switch from center-crop to tracked vertical video consistently see higher average watch time on their Shorts and Reels. The improvement isn't marginal. The subjective quality difference is immediately apparent to any viewer, even without understanding the technical reason.
For interview content, podcast clips, and lecture footage — the most common sources of short-form content — proper face tracking is the difference between clips that feel produced and clips that feel like something went wrong.
ClipSpeedAI applies this face tracking pipeline automatically to every video processed, no configuration required.
DEV Community
https://dev.to/kyle_clipspeedai/face-tracking-for-vertical-video-why-its-harder-than-it-looks-and-how-it-works-3e97Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
productcomponent
I Built an Autonomous Phone Verification Agent (Full Code + Tutorial)
You built an AI agent that can browse the web, fill forms, and make decisions. But it hits a wall every single time: "This phone number cannot be used for verification." Stripe. WhatsApp. Google. All of them run a carrier lookup. All of them block VoIP. And every cloud phone API returns line_type: VoIP . Your agent is stuck. Not because it's dumb. Because the infrastructure is wrong. Today I'll show you how to fix it. Full code. Real results. The Problem in 30 Seconds When Stripe receives a phone number, it doesn't just send an SMS. First, it queries the LERG/NPAC databases to check: what type of number is this? $ curl https://lookups.twilio.com/v2/PhoneNumbers/+16505550001 { "line_type_intelligence" : { "type" : "voip" , "carrier" : "Twilio Inc." } } Result: REJECTED . Stripe blocks VoIP

Web Theme Loader, hand-crafted or AI generated
I have crafted one from scratch based on specific functional requirements and technical requirements, conforming to my design principles for UI, UX and Developer Experience. Developer Experience expected: Rich features with small API surface. Flexibility. Portability Reusability. If you just want to use this loader right away, please check " Web Theme Loader with Comprehensive Features and Minimum API Surface " This article is the first in a series of articles comparing hand‑crafted code and code generated by AI agents on a non‑trivial topic. Requirements User Story As a web app user, I want to choose from multiple available themes — sometimes light, other times dark. Work Order Develop a TypeScript‑based, framework‑agnostic API that exposes reusable helpers and abstractions for implementi

How Creators Use Instagram DM Automation to Scale Faster (2026 Guide)
Instagram growth today is no longer just about posting content. For creators, real growth happens in conversations — inside DMs. Every comment, reply, and message is an opportunity. But as your audience grows, one challenge becomes clear: You can’t manually manage every DM. That’s why smart creators are now using Instagram DM automation to scale faster, stay responsive, and turn engagement into leads. Why Instagram DMs Matter for Growth Most creators focus heavily on content — reels, posts, and stories. But the real conversion happens in DMs. Creators use Instagram DMs for: Brand collaborations Product inquiries Course or community signups Customer support Lead generation When someone sends a message, they are already interested. This makes DMs one of the highest-converting channels on Ins
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

Web Theme Loader, hand-crafted or AI generated
I have crafted one from scratch based on specific functional requirements and technical requirements, conforming to my design principles for UI, UX and Developer Experience. Developer Experience expected: Rich features with small API surface. Flexibility. Portability Reusability. If you just want to use this loader right away, please check " Web Theme Loader with Comprehensive Features and Minimum API Surface " This article is the first in a series of articles comparing hand‑crafted code and code generated by AI agents on a non‑trivial topic. Requirements User Story As a web app user, I want to choose from multiple available themes — sometimes light, other times dark. Work Order Develop a TypeScript‑based, framework‑agnostic API that exposes reusable helpers and abstractions for implementi

Why Smart Creators Are Automating Instagram DMs in 2026
Instagram growth has changed. It’s no longer just about posting reels or going viral. The real opportunity now lies in what happens after the engagement — inside DMs. Every comment, every reply, every message is a signal of interest. But as creators grow, one problem becomes obvious: You can’t manually manage conversations at scale. That’s why smart creators are now adopting Instagram DM automation to stay responsive, capture leads, and scale faster. The Shift: From Content to Conversations A few years ago, growth on Instagram meant: More followers More views More likes In 2026, growth looks different. Creators are focused on: Conversations Lead generation Conversions Because that’s where real value comes from. And most of that happens inside DMs. The Problem with Manual DM Management As e




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!