Locara

text-to-video

HF group: Computer Vision · Status: ❌ not built · candidate for Tier 4 (defer, hardware-bound)

What it is

Text → video clip. Computationally heavy.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Mochi-110 B2024-10Apache-2.0Best open-weight quality24 GB+ VRAM at full precision.
HunyuanVideo13 B2024-12TencentTop quality40-80 GB VRAM full precision; FP8 variants exist.
CogVideoX-5B5 B2024Apache-2.0Strong, runs on consumer GPUBest Mac-runnable.
LTX-Video2 B2024OpenRAIL-MFast (real-time on RTX 4090)Lightest.
Wan 2.2~10 B2026Apache-2.0StrongNewer entrant.

Infrastructure required

Inference

  • ❌ Video diffusion runtime (sharing infrastructure with text-to-image).

Input

  • Plain text prompt.

Output

  • ❌ Video file save (10-100 MB clips). Storage budget matters.

Storage

  • ❌ Weights cache (large — 5-13 B).
  • Output: fs.user-folder.

Interaction (IPC + SDK)

  • video.generate({ prompt }) IPC with long-running task management — generation takes 30 s – 5 min. Progress + cancel + resume semantics required.

Capabilities (manifest)

  • capabilities.fs.user-folder write.
  • capabilities.models[] for the diffusion model.

Gaps

Whole stack. Realistically the last modality v1 should ship — RAM/VRAM cost is high enough that it doesn’t fit Locara’s “runs on a 16 GB MacBook Air” promise.

See also