`text-to-video`

HF group: Computer Vision · Status: ❌ not built · candidate for Tier 4 (defer, hardware-bound)

What it is

Text → video clip. Computationally heavy.

Open-weight models

Model	Params	Released	License	Quality	Notes
Mochi-1	10 B	2024-10	Apache-2.0	Best open-weight quality	24 GB+ VRAM at full precision.
HunyuanVideo	13 B	2024-12	Tencent	Top quality	40-80 GB VRAM full precision; FP8 variants exist.
CogVideoX-5B	5 B	2024	Apache-2.0	Strong, runs on consumer GPU	Best Mac-runnable.
LTX-Video	2 B	2024	OpenRAIL-M	Fast (real-time on RTX 4090)	Lightest.
Wan 2.2	~10 B	2026	Apache-2.0	Strong	Newer entrant.

Infrastructure required

Inference

❌ Video diffusion runtime (sharing infrastructure with text-to-image).

Input

Plain text prompt.

Output

❌ Video file save (10-100 MB clips). Storage budget matters.

Storage

❌ Weights cache (large — 5-13 B).
Output: fs.user-folder.

Interaction (IPC + SDK)

❌ video.generate({ prompt }) IPC with long-running task management — generation takes 30 s – 5 min. Progress + cancel + resume semantics required.

Capabilities (manifest)

capabilities.fs.user-folder write.
capabilities.models[] for the diffusion model.

Gaps

Whole stack. Realistically the last modality v1 should ship — RAM/VRAM cost is high enough that it doesn’t fit Locara’s “runs on a 16 GB MacBook Air” promise.

text-to-video