Locara

image-text-to-video

HF group: Multimodal · Status: ❌ not built

What it is

Image + text prompt → video. Animate a still with an instruction (e.g. “make this person wave”). Distinct from image-to-video (no text instruction).

Open-weight models

ModelParamsReleasedLicenseQualityNotes
CogVideoX-5B-I2V5 B2024Apache-2.0Best Mac-runnable I2VFollows text prompt for motion better than SVD.
CogVideoX1.5-5B-I2V5 B2025Apache-2.010 s clips, any resolutionHigher fidelity than 1.0.
Wan 2.2 I2V~10 B2026Apache-2.0StrongCloud-GPU friendly.
HunyuanVideo I2V13 B2025TencentTop quality24 GB+ VRAM.

Infrastructure required

Inference

  • ❌ Diffusion runtime + video diffusion specifics.

Input

  • ❌ Image input pipeline.
  • Text prompt.

Output

  • ❌ Video file save (10-100 MB clips). Buffer locally → atomic move into fs.user-folder.

Storage

  • ❌ Weights cache.
  • Output: fs.user-folder.

Interaction (IPC + SDK)

  • video.animate({ image, prompt }) IPC with long-running task progress (generation takes 30 s – 5 min).
  • ❌ Cancel + resume semantics.

Capabilities (manifest)

  • capabilities.fs.user-selected for input.
  • capabilities.fs.user-folder for output.
  • capabilities.models[] for the I2V model.

Gaps

Same as text-to-video: diffusion runtime, image input pipeline, video output IPC, long-running task progress IPC.

See also