How Close Is AI to Automating Content Creation? A Working Clip Pipeline

Close enough that a mostly hands-off pipeline can now turn one long video into several finished vertical clips and even upload them. It chains FFmpeg, a local Whisper model, Claude Opus 4.7 for moment selection, YOLO and Light ASD for framing, and code-driven Remotion for editing.

The state of automated short-form

Every six months this creator rebuilds his short-form clip pipeline to measure how far the models have come, and the latest version is the most convincing yet. The goal is simple: take one long video and produce finished vertical clips with minimal manual work. What is interesting is that the system is not one giant model doing everything. It is a chain of specialized tools, each handling the job it is best at, stitched together so the handoffs are automatic.

Audio first, locally

The pipeline starts with any source video, then immediately extracts the audio with FFmpeg, because working from audio saves a lot of downstream time. That audio goes into a local Whisper model running on his Mac, producing a transcript with timestamps. Keeping transcription local is a deliberate cost choice. There is no upload step and no per-minute fee, which is what makes running this repeatedly practical rather than expensive.

Where the model makes decisions

The crucial step is selection. For an hour-long video, Claude Opus 4.7 reads the timestamped transcript and hunts for the strongest moments, scoring candidates and authoring the clips it thinks will perform. On an 89-minute Diary of a CEO episode, it read the transcript, scored moments, and produced three finished clips in roughly five to ten minutes. This is the part that used to require a human scrubbing through footage, and it is now the fastest decision in the chain.

Framing the speaker correctly

Good vertical clips live or die on framing, so two more models step in:

YOLO detects the faces in each clip and keeps them in frame
Light ASD determines which detected face is actually speaking
The clip then reframes from 16:9 to vertical, with the crop following the active speaker

- YOLO detects the faces in each clip and keeps them in frame - Light ASD determines which detected face is actually speaking - The clip then reframes from 16:9 to vertical, with the crop following the active speaker

Without active-speaker detection, a vertical crop of a two-person interview would constantly sit on the wrong face. That single model is what makes the switching look natural.

Editing as code

The last production step is retention editing in Remotion, which is driven by code rather than a timeline editor. It layers captions, zooms, flashes, and meme sound effects, and because it is programmatic, the same style applies to every clip automatically. He demonstrated it across several formats, a podcast, a react video, and interviews, and the speaker switching held up well on the tougher multi-person clips.

Upload without the API

To close the loop, he uses a browser Surf Agent instead of the platform API. It picks the file, writes the title, sets the visibility to private, and hits save on its own. That is handy if you have a spare machine like a Mac mini logged into your account, since it avoids API setup entirely.

The honest caveat

The pipeline is fast and the output is clean, but he is careful not to overclaim. He has not actually posted any of these clips yet, so their real performance is unproven. The combination of Whisper, YOLO, Light ASD, and Remotion is clearly far ahead of where it was six months ago, and the next step is simply to post and measure. Treat this as a working blueprint, not a finished results study, and you have a genuinely useful automation map.