Ministudio

Late last year I attended an AI film making event , it had creativity zooming all over the place artists, photographers and filmmakers chatting and talking about the future of film making with AI. I was having fun but I had the most benefit from the last section of the workshop in which an amazing lady give a talk on the technicality of film making , a lady by the name Mercy Mutisya(if you reading this thanks a lot ) who writes scripts for living. I was amazed by the depth, tricks and technicality of film making and how AI really fills lots of gaps but never takes the mantle in totality.

In essence film making or content creation is a step by step process with lots of fine tuning and double checking, she talked about the best models for video generation and how to generate cinematically accurate videos I loved the talk but I did not have any immediate application of the things she taught me just not yet .

This week I was looking around for video to use to explain a project I was working on, a learning management platform . I thought maybe I should look into generating a video instead of trying to be a content creator which I am not. I tried using Google’s video gen models like Veo and OpenAI’s sora, it was bad, really bad 😂. The videos are just out of sync, the character and background continuity is whacky and the model keeps on changing things clearly I was bad prompt engineer so I thought maybe I could be a good software engineer and generate videos with state persistence and continuity of props and character and background with code and I just might have succeeded.

So the problem with generative models is hallucination and inconsistency. If you ask an AI model to generate a 10 second video of character doing stuff it would do a really great job and generate an impressive work however if you then ask for another 10 second video of the same character a new place , the character’s face, clothes and even the skin tone sometimes change. I believe this an issue that can be solved with code and here is my take on it .

I created an opensource Python framework called Ministudio, its an orchestration layer that treats videos as production like k8 deployments(or a master orchestrator that keeps everything in sync and in continuity)

How it works: It has 3 pillars

1. Identity grounding :

Instead of just prompting a character we use visual tokens We anchor the character’s details like (hair, eyes, build) in every generation step. By providing a reference portrait we force the model to respect the identity across 60+ seconds of footage.

2. Temporal Context

Models forget the background the moment the camera moves, Ministudio uses a State machine that tracks the environment geometry. We also extract the last 3 frames of every shot and feed them back into the next generation. This creates a memory that eliminates visual jumps (a problem am currently having with current logic is audio jumps)

3. An orchestrated audio sync

Audio should not be an after taught, we synchronize the generated video with audio handling variables lengths and sequential merging

Example Generated videos from my first trials

This is the video I need to generate for my own needs which was the first was sample

It uses Studio Ghibli aesthetics and Makoto Shinkai-lighting(I have no idea about those fancy terms ) all controlled via 20 lines of Python code found at our GitHub repo linked below: The script needs some work and I had the characters keep on changing appearance and overall details.

Here is another video of a grandpa explaining Quantum Mechanics to a kid

The continuity of both characters is really terrible and audio becomes a female voiceover once 😂

The last one I generated that gave me some success!!!

The character continuity and background is really fine , needs more work on voice and tonality but overly the more state we can keep track of the better the generation as we can see .

This is the latest iterations of our story telling

The code used to generate above videos

python

1"""
2ContextBytes: Human & Machine Harmony (Brand Story) - FLAGSHIP 60s+ EDITION
3======================================================================
4A cinematic introduction to ContextBytes with Dynamic Duration & Narrative Flow.
5Story: Emma (Student) & David (Professional) find clarity via ContextKeeper.
6"""
7
8import asyncio
9from pathlib import Path
10
11from ministudio.orchestrator import VideoOrchestrator
12from ministudio.providers.vertex_ai import VertexAIProvider
13from ministudio.config import (
14    VideoConfig, SceneConfig, ShotConfig, ShotType,
15    Character, Environment, StyleDNA, Persona, DEFAULT_PERSONA,
16    Cinematography, Camera, Color
17)
18
19# ============================================================================
20# CINEMATOGRAPHY - Master Filmmaker Presets
21# ============================================================================
22
23PREMIUM_CINE = Cinematography(
24    camera_behaviors={
25        "chaos_pan": Camera(lens="24mm", aperture="f/4", movement_style="jittery handheld pan-through-clutter"),
26        "discovery_macro": Camera(lens="100mm", aperture="f/2.8", movement_style="focus pull from screen to face"),
27        "architecture_top": Camera(lens="35mm", aperture="f/8", movement_style="high-angle crane down"),
28        "hero_infinite": Camera(lens="50mm", aperture="f/1.8", movement_style="slow push-in to subjects")
29    },
30    shot_composition_rules={
31        "rule_of_thirds": True,
32        "leading_lines": "towards the Knowledge Orb",
33        "depth_layering": "foreground bokeh, midground subjects, background architecture"
34    }
35)
36
37# ============================================================================
38# CHARACTERS - Visual Anchor Links
39# ============================================================================
40
41EMMA = Character(
42    name="Emma",
43    identity={
44        "hair": "short chestnut brown bob, hand-drawn texture with soft bangs",
45        "eyes": "large inquisitive amber eyes with detailed catchlights",
46        "face": "soft round Shinkai-style face, expressive subtle smile",
47        # Absolute consistency lock
48        "skin_tone": "fair porcelain with slight pink blush on cheeks",
49        "build": "slender, wearing a high-quality cerulean blue wool sweater",
50        "aesthetic": "painterly Ghibli protagonist, cinematic digital painting"
51    },
52    visual_anchor_path="c:/Users/USER/Music/ministudio/assets/references/emma_portrait.png",
53    current_state={
54        "clothing": "cerulean blue winter sweater, messy desk environment"},
55    voice_id="en-US-Studio-O",  # Warm, welcoming female narrator
56    voice_profile={"style": "narrative", "pitch": 0.5}
57)
58
59DAVID = Character(
60    name="David",
61    identity={
62        "hair": "neatly groomed short onyx black hair",
63        "eyes": "deep intelligent dark eyes, scholarly focus",
64        "face": "focused angular features, clean-shaven",
65        # Absolute consistency lock
66        "skin_tone": "warm bronze skin with detailed hand-drawn shadows",
67        "glasses": "minimalist silver-rimmed circular glasses",
68        "aesthetic": "refined professional Ghibli style"
69    },
70    visual_anchor_path="c:/Users/USER/Music/ministudio/assets/references/david_portrait.png",
71    current_state={
72        "clothing": "charcoal grey corporate shirt, forest green scarf"}
73)
74
75Keeper = Character(
76    name="The ContextKeeper",
77    identity={
78        "form": "a levitating orb of liquid golden light, tennis-ball size",
79        "glow": "radiates #D4AF37 golden pulses and floating motes",
80        "texture": "ethereal, translucent golden core"
81    }
82)
83
84# ============================================================================
85# ENVIRONMENTS - Chaos to Wisdom
86# ============================================================================
87
88CHAOTIC_DORM = Environment(
89    location="Emma's Biology Dorm",
90    identity={
91        "architecture": "cluttered bookshelves, messy desktop, stacks of biology PDFs"},
92    current_context={
93        "lighting": "dim indoor light, blue glare from multiple computer screens",
94        "atmosphere": "claustrophobic, overwhelming information overload",
95        "time_of_day": "late night study session"
96    },
97    reference_images=[
98        "c:/Users/USER/Music/ministudio/assets/references/data_abyss_bg.png"]
99)
100
101CORPORATE_MAZE = Environment(
102    location="Modern Tech Office Lab",
103    identity={
104        "architecture": "glass walls, whiteboards filled with complex architecture diagrams"},
105    current_context={
106        "lighting": "slick fluorescent lighting, high-contrast shadows",
107        "atmosphere": "dry, technical, professional overwhelm",
108        "time_of_day": "busy afternoon"
109    },
110    reference_images=[
111        # Anchor for David's office vibes
112        "c:/Users/USER/Music/ministudio/assets/references/shinkai_stratosphere_bg.png"]
113)
114
115GITHUB_GARDEN = Environment(
116    location="The Knowledge Garden Study",
117    identity={
118        "architecture": "arched mahogany bookshelves, high ceilings, spiral stairs"},
119    current_context={
120        "lighting": "warm afternoon sun with visible dust motes (Tyndall effect)",
121        "atmosphere": "magical, painterly, deep academic peace",
122        "time_of_day": "golden hour"
123    },
124    reference_images=[
125        "c:/Users/USER/Music/ministudio/assets/references/ghibli_atelier_bg.png"]
126)
127
128STYLE_DNA = StyleDNA(
129    traits={
130        "visual_style": "Studio Ghibli hand-painted backgrounds",
131        "lighting_style": "Makoto Shinkai vibrant lens flares and glowing edges",
132        "color_palette": "Deep teals (#008080) transitioning to Master Gold (#D4AF37)",
133        "brushwork": "Painterly, thick impasto textures on clouds",
134        "detail_level": "Ultra-high, hyper-focused foregrounds"
135    },
136    references=["Spirited Away", "Your Name"]
137)
138
139
140async def create_brand_video():
141    print("Starting FLAGSHIP Production: ContextBytes Brand Story (Dynamic Flow)...")
142
143    provider = VertexAIProvider()
144    orchestrator = VideoOrchestrator(provider)
145
146    scene = SceneConfig(
147        concept="From Chaos to Human Wisdom",
148        mood="Intellectual, Cinematic, Magical",
149        characters={"Emma": EMMA, "David": DAVID, "Keeper": Keeper},
150        shots=[
151            # 1. Emma's Struggle (Demonstrates long narration splitting)
152            ShotConfig(
153                shot_type=ShotType.WS,
154                environment=CHAOTIC_DORM,
155                action="Wide jittery pan across Emma's room. Thousands of digital windows overlap in the air—PDFs, YouTube playlists, and research articles. Emma rubs her tired eyes, looking defeated by the stacks of books and open browser tabs.",
156                narration=(
157                    "In a world where information moves faster than we can think, we often find ourselves lost. "
158                    "Emma is a brilliant student, but even she is drowning in a sea of millions of PDFs, endless playlists, "
159                    "and a thousand open tabs that lead nowhere. She's looking for wisdom, but she only finds noise."
160                ),
161                # This will be ~15s, triggering recursive splitting (8s + 7s)
162                duration_seconds=None
163            ),
164
165            # 2. The Discovery
166            ShotConfig(
167                shot_type=ShotType.CU,
168                environment=CHAOTIC_DORM,
169                action="Close-up on Emma's laptop screen. She opens ContextBytes. A warm golden pulse radiates from the center. The Keeper orb emerges from the UI, its light cleaning the digital clutter into organized spheres.",
170                narration="Meet Emma. She didn't need more data; she needed a way to make sense of it. She found ContextBytes.",
171                duration_seconds=None,
172                continuity_required=True
173            ),
174
175            # 3. The AI Teacher
176            ShotConfig(
177                shot_type=ShotType.MS,
178                environment=GITHUB_GARDEN,
179                action="Shot in the Garden Atelier. The Keeper levitates, projecting a glowing teal 3D biology model. Emma watches, her face lighting up as she finally understands. The atmosphere is peaceful.",
180                narration="Our agent, the ContextKeeper, doesn't just give answers. It guides you, explains the 'why', and organizes your path to mastery.",
181                duration_seconds=None,
182                continuity_required=True
183            ),
184
185            # 4. David's Professional Struggle
186            ShotConfig(
187                shot_type=ShotType.WS,
188                environment=CORPORATE_MAZE,
189                action="A high-angle shot of David in a sleek, cold tech office. He's dwarfed by skyscrapers of technical documentation and architectural specs. He looks stressed, trying to find clarity in the noise.",
190                narration=(
191                    "And then there’s David. A professional engineer lost in the giant tech machine, "
192                    "drowning in documentation, architectural specs, and complex specs that seem to have no end. "
193                    "In the corporate maze, context is the first thing that we lose."
194                ),
195                duration_seconds=None  # Split likely (8s + 4s)
196            ),
197
198            # 5. David's Clarity
199            ShotConfig(
200                shot_type=ShotType.WS,
201                environment=CORPORATE_MAZE,
202                action="The Keeper orb flies through David's office. Behind it, a beautiful glowing golden Knowledge Graph appears, physically connecting documents like a magical glowing architecture map.",
203                narration="ContextBytes reveals the invisible threads between documents—transforming a mountain of text into a clear, magical map of how everything flows.",
204                duration_seconds=None,
205                continuity_required=True
206            ),
207
208            # 6. Final Harmony
209            ShotConfig(
210                shot_type=ShotType.WS,
211                environment=GITHUB_GARDEN,
212                action="Emma and David stand on a balcony overlooking the Cloud Stratosphere. They look confident and inspired. The Keeper orb flies toward the camera, merging into the final brand signature.",
213                narration=(
214                    "From the student's desk to the corporate boardroom, the path to mastery is now clear. "
215                    "Deep simplicity. Modern intelligence. This is your Context, mastered. Welcome to ContextBytes."
216                ),
217                duration_seconds=None,
218                continuity_required=True
219            )
220        ]
221    )
222
223    config = VideoConfig(
224        persona=DEFAULT_PERSONA,
225        style_dna=STYLE_DNA,
226        cinematography=PREMIUM_CINE,
227        output_dir="./contextbytes_production",
228        aspect_ratio="16:9",
229        negative_prompt="photorealistic, 3d render, CGI, grainy, distorted face, bad anatomy, low quality"
230    )
231
232    # Run the production
233    result = await orchestrator.generate_production(
234        scene=scene,
235        base_config=config,
236        output_filename="contextbytes_flagship_dynamic.mp4"
237    )
238
239    if result["success"]:
240        print(
241            f"SUCCESS! Dynamic Flagship video saved to: {result['local_path']}")
242    else:
243        print(f" FAILED: {result.get('error')}")
244
245if __name__ == "__main__":
246    asyncio.run(create_brand_video())
247

The Road Ahead

We are now entering the age of precise control the tiny flickers in the background or the 0.5 lag in lip sync in narrations.

We will need deeper background masking and locking the environment only generating the character motions and interaction with objects or props

Ensuring the noise of the model is consistent across the whole scene and waveform orchestration using actual audio wavelengths to drive the duration and intensity

Ministudio is an opensource framework the code can be found at here

Check it out and read the markdown docs to contribute and add more models support if you can.

#Programmable Cinematography: Building Ministudio