text-to-video
HF group: Computer Vision · Status: ❌ not built · candidate for Tier 4 (defer, hardware-bound)
What it is
Text → video clip. Computationally heavy.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Mochi-1 | 10 B | 2024-10 | Apache-2.0 | Best open-weight quality | 24 GB+ VRAM at full precision. |
| HunyuanVideo | 13 B | 2024-12 | Tencent | Top quality | 40-80 GB VRAM full precision; FP8 variants exist. |
| CogVideoX-5B | 5 B | 2024 | Apache-2.0 | Strong, runs on consumer GPU | Best Mac-runnable. |
| LTX-Video | 2 B | 2024 | OpenRAIL-M | Fast (real-time on RTX 4090) | Lightest. |
| Wan 2.2 | ~10 B | 2026 | Apache-2.0 | Strong | Newer entrant. |
Infrastructure required
Inference
- ❌ Video diffusion runtime (sharing infrastructure with
text-to-image).
Input
- Plain text prompt.
Output
- ❌ Video file save (10-100 MB clips). Storage budget matters.
Storage
- ❌ Weights cache (large — 5-13 B).
- Output:
fs.user-folder.
Interaction (IPC + SDK)
- ❌
video.generate({ prompt })IPC with long-running task management — generation takes 30 s – 5 min. Progress + cancel + resume semantics required.
Capabilities (manifest)
capabilities.fs.user-folderwrite.capabilities.models[]for the diffusion model.
Gaps
Whole stack. Realistically the last modality v1 should ship — RAM/VRAM cost is high enough that it doesn’t fit Locara’s “runs on a 16 GB MacBook Air” promise.