Declarative vs Procedural: Why Templates Beat Prompt-to-Video

There are two fundamentally different ways to tell a computer how to make a video.

One says what the video should contain. The other says how to draw every frame.

This distinction, declarative versus procedural, shapes everything from how fast you can iterate to whether an AI can do the work for you. And it turns out it matters a lot more than most people think.

Two Ways to Make a Video

Let's start with a simple example. You want to animate a blue circle moving across the screen.

The procedural approach

In a procedural system, you write step-by-step instructions. “Create a circle at position (0, 0). On each frame, move it 2 pixels to the right. After 120 frames, stop.”

You control every detail. Frame rates. Easing curves. Exact pixel positions. The code reads like a recipe: do this, then this, then this.

Manim, the Python library originally created by Grant Sanderson (3Blue1Brown) to produce his famous math videos, is a great example of the procedural approach. In Manim, you define a scene as a sequence of explicit animation steps:

class CircleAnimation(Scene):
    def construct(self):
        circle = Circle(color=BLUE)
        self.play(Create(circle))
        self.play(circle.animate.shift(RIGHT * 3))
        self.play(FadeOut(circle))

Every animation is an explicit command. “Play this. Then play that. Then fade out.” The programmer choreographs every movement. This gives you extraordinary control. Manim can produce incredibly beautiful, precise animations, which is exactly why 3Blue1Brown's videos look the way they do.

3Blue1Brown's math animations are built entirely with Manim. Every movement is hand-choreographed in Python.

But that control comes at a cost: you need to think about how things move, not just what they should look like.

Here is what procedural animation looks like in practice. Notice the code execution indicator stepping through each command:

Procedural: each animation step is an explicit command executed in sequence.

The declarative approach

In a declarative system, you describe the desired outcome. “There is a blue circle. It starts on the left and ends on the right. The transition takes 2 seconds.”

You do not write the animation loop. You do not manage frame counts. The system figures out the “how” for you. You just say what you want.

Remotion is a good example of this. Remotion lets you build videos using React, the same framework used to build web interfaces. You describe your video as a tree of components with props and timing:

const MyVideo = () => (
  <Composition
    component={ListingVideo}
    durationInFrames={300}
    fps={30}
    width={1080}
    height={1920}
    defaultProps={{
      address: "123 Ocean Drive",
      price: "$1,250,000",
      photos: [photo1, photo2, photo3]
    }}
  />
);

Notice what is happening here. You are not telling the system how to animate anything. You are declaring what the video contains: an address, a price, some photos, a duration. The component itself decides how to render those props into a visual sequence.

Declarative: spring physics and interpolation handle the motion. You describe the intent.

This is the same mental model as building a web page with React. You describe the UI. React renders it. You describe the video. Remotion renders it.

Both animations above are built with Remotion and running live on this page. Same library, two different mental models. The declarative version is shorter, easier to modify, and far easier for an AI to generate.

Why This Distinction Matters for AI

Here is where it gets interesting.

If your video system is declarative, then generating a video becomes a data problem, not a creative choreography problem. An LLM does not need to understand animation timing, easing functions, or pixel-level positioning. It just needs to produce the right data structure.

Data props flow into a template. The template handles all the visual decisions.

Structured outputs make it even easier

There is another advantage that is easy to overlook. Declarative templates follow specific schemas. A listing video template expects a JSON object with fields like address, price, beds, and template. Each field has a defined type and set of valid values.

Modern LLMs have structured output as a first-class feature. You can give the model a JSON schema and it will produce valid, well-formed output every time. The animation might not win design awards, but the data structure is always correct: proper field names, valid types, no syntax errors.

Compare that to having an AI write procedural animation code. Without a compilation step or runtime checks, you have no idea if the output is even valid until you try to run it. A missing semicolon, a misspelled function name, a wrong parameter type, any of these silently breaks the whole thing. The feedback loop is fundamentally different:

Declarative: schema validates JSON instantly. Procedural: compile, render, review, fix, repeat.

With a declarative approach, validating the AI's output is as fast as running a JSON schema check, milliseconds. With procedural code, you need to compile it, render a video, visually inspect the result, decide what is wrong, and send it back. That loop takes minutes, not milliseconds.

“Prompt to video” in a declarative system looks like this:

User writes a prompt: “Create a Just Listed video for 123 Ocean Drive, $1.25M, 4 bed 3 bath”
LLM translates the prompt into a structured props object (address, price, features, photos)
The declarative video template renders the final video

The AI never touches animation code. It just fills in a form. The template handles everything visual.

Now imagine doing the same thing with a procedural system. The AI would need to write code like “move this text to coordinates (540, 200), fade it in over 15 frames with a cubic-bezier easing, then slide the photo in from the right at frame 45.” That is dramatically harder, more error-prone, and nearly impossible to get right without visual feedback.

Three Levels of AI Video Generation

When you think about how AI can generate videos, there are really three levels of ambition.

Level 1: Prompt to template (safe and fast)

The AI picks a pre-built template and fills in the data. The visual design is locked in. The AI's job is limited to understanding user intent and mapping it to the right template with the right content.

This is the most reliable approach. Templates are designed by humans with taste and experience. The AI handles the boring part: extracting listing details, writing copy, selecting photos. The template handles the hard part: timing, transitions, typography, and visual hierarchy.

Level 2: Prompt to full video (powerful but fragile)

The AI generates the entire video definition, choosing layouts, animations, colors, and timing from scratch. In a declarative system like Remotion, this means the LLM would generate a complete React component.

This sounds amazing in theory. In practice, it is extremely hard to get right. The AI might choose a font size that clips on mobile. It might place text over a busy part of the photo where it becomes unreadable. It might pick transitions that feel cheap or timing that feels rushed.

Motion graphics is a craft. The difference between “professional” and “PowerPoint slideshow” is hundreds of small decisions about spacing, timing, and rhythm. LLMs are not great at those decisions yet.

Level 3: Prompt to video with a feedback loop

This is the most ambitious approach. The AI generates a video, then watches it, analyzes the result, and iterates. “The text is too small. The transition feels too fast. The logo is cut off.” Fix, re-render, repeat.

We actually tried this.

The results were... educational. Current vision models can tell you roughly what is in a video frame. They can identify objects, read text, describe scenes. But they are surprisingly bad at the things that matter for motion graphics:

Layout precision — “Is this text too close to the edge?” Models struggle with spatial relationships at the pixel level
Text readability — “Can you actually read this white text over this light photo?” Models often say yes when a human would squint
Timing and rhythm — “Does this transition feel too fast?” This is almost entirely subjective and models have no reliable sense of it
Visual hierarchy — “Where does the eye go first?” Models can describe elements but cannot evaluate visual weight

The feedback loop approach is theoretically sound. But until video understanding catches up to video generation, it is too slow and too unreliable for production use. You end up burning minutes of render time and API calls to produce something a well-designed template would have nailed in seconds.

Declarative Thinking Is Everywhere

This is not just a video problem. The declarative versus procedural tension shows up across every domain where AI meets creative output.

Web development — React (declarative) versus jQuery (procedural). React won because declaring UI as a function of state is easier to reason about, and much easier for AI tools like Copilot and Cursor to generate
Mobile apps — SwiftUI and Jetpack Compose (declarative) versus UIKit and Android Views (procedural). The industry moved declarative because it is simpler, and AI code generation works dramatically better with declarative frameworks
Infrastructure — Terraform (declarative) versus shell scripts (procedural). You declare the desired state. The system figures out how to get there
Animations — CSS transitions (declarative) versus requestAnimationFrame (procedural). React Native's Animated API uses a declarative model. Lottie animations are declarative JSON files. The pattern is clear
Data — SQL (declarative) versus writing loops to filter arrays (procedural). You say what data you want, not how to fetch it

The pattern is always the same: declarative systems separate what fromhow. And that separation is exactly what makes them AI-friendly. An LLM can reliably produce a “what.” Producing a “how” requires domain expertise that models are still developing.

When to use which

Neither approach is universally better. The right choice depends on what you are building.

	Procedural (Manim, manual code)	Declarative Templates (Remotion)
Best for	Long-form, one-of-a-kind content with custom animations	Short, repeatable content that varies by data
Creative ceiling	Unlimited. Every frame is fully controlled	Constrained by template design. High within those constraints
Speed per video	Hours to days per unique piece	Seconds to minutes. Change the data, re-render
AI compatibility	Difficult. AI must write animation logic	Natural. AI only needs to produce data
Skill required	Programming + animation design + taste	Template design once, then minimal per video
Examples	3Blue1Brown explainers, custom brand films, documentary graphics	Real estate listings, social media ads, event announcements
Scale	One at a time, artisanal	Thousands per day, automated

An important nuance: declarative does not mean simple. A Remotion template can produce highly complex, cinematic videos. The constraint is that the complexity lives in the template design, not in the per-video creation process. That is actually the whole point: invest in the template once, then stamp out professional videos at scale.

You might wonder: could you embed procedural code inside a declarative template? Technically, yes. But at that point you are not really using a declarative approach anymore. You are just wrapping procedural logic in a declarative container. The AI still needs to understand the procedural parts, which defeats the purpose.

It is also worth noting that commercial tools like Adobe After Effects and Animate are powerful but sit outside this comparison. They require expensive licenses, manual operation, and currently have limited AI entry points. There is no good way for an LLM to drive After Effects. That may change, but it is not the reality today.

For Real Estate, Templates Win

All of this theory is interesting, but what does it mean in practice?

For real estate marketing videos specifically, the answer is clear: templates are the right abstraction. Here is why.

Real estate videos are repetitive by nature. Every listing needs the same types of videos: Just Listed, Open House, Price Reduction, Just Sold. The structure is predictable. The data changes, but the format does not.

This is exactly where declarative templates shine. A human designer creates the template once, pouring their taste and motion graphics expertise into every detail: the perfect easing curve on the price reveal, the exact moment the address fades in, the subtle parallax on the hero photo.

Then the AI fills in the blanks. New photos. New address. New price. Same professional result.

A property listing video generated from a template with AI voiceover. The visual design was crafted by a human. The data was filled in by AI.

Compare this to asking an AI to generate the entire video from scratch. You might get something decent on the third or fourth try. But “third or fourth try” means three or four rendering cycles, each taking time and compute. For an agent who needs a video in under a minute, that is a non-starter.

What AI Cannot Do Yet

Let's be honest about where the boundaries are.

Typical video montages done by human editors involve a density of creative decisions that AI simply cannot replicate today. A skilled editor is simultaneously thinking about:

Timing — Cutting on the beat. Holding a shot just long enough for emotional impact. Knowing when a transition should be sharp and when it should breathe
Positioning — Text placement that accounts for safe zones, visual weight, and the viewer's eye path. Understanding that a title at the top of the frame feels different than one at the bottom
Taste — The hardest one. Knowing that a slow dissolve feels elegant while a hard cut feels energetic. That a 200ms ease-in feels snappy while 400ms feels smooth. These are not rules you can codify. They are intuitions built from watching thousands of videos

Real-time video understanding, the ability for an AI to “watch” a rendered video and make these kinds of judgments, is still beyond current capabilities. Models can describe what they see, but they cannot evaluate whether it feels right.

This is not a temporary limitation that will be fixed in the next model release. Motion graphics is a deeply embodied skill. It draws on rhythm, spatial reasoning, and aesthetic judgment in ways that text-trained models are not equipped for.

Will AI get there eventually? Probably. But not soon enough to bet your marketing pipeline on it.

The Sweet Spot

The pragmatic answer, the one that works today, is to combine human design with AI automation.

Humans design the templates. They make the hundreds of micro-decisions about timing, positioning, and taste that create a professional result.

AI fills in the data. It extracts listing details, writes voiceover scripts, selects the right template for the moment, and maps everything to the right props.

The declarative layer, the template, is the bridge between human craft and AI speed.

Declare the what. Design the how. Let AI handle the rest.

That is the approach we took with Video Creator. Not because it was the simplest option, but because after experimenting with prompt-to-video and feedback loops, it was the one that actually worked.

Sometimes the best engineering decision is knowing what to automate and what to leave to humans.

This is the thinking behind how we built Video Creator at Bounti. If you are a real estate agent or brokerage looking for professional listing videos without the production overhead, it might be worth a look. Templates, AI voiceover, every major platform, under a minute per video.

Try it free or read more about how Video Creator works.