Locara

video-text-to-text (video Q&A / description)

HF group: Multimodal · Status: ❌ not built

What it is

Video (+ optional text question) → text. “What happens in this clip?”, “When does the person sit down?”.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Qwen2.5-VL-7B7 B2025Apache-2.0Strong on short clips1-FPS sampling typical.
MiniCPM-V 8B8 B2025Apache-2.0Lightweight videoEdge-friendly.
LLaVA-NeXT-Video-7B7 B2024Apache-2.0ReasonableOlder.
Qwen3.5-Omni~30 B / 3 B active2026-03Apache-2.0400+ s of 720p video at 1 FPSSame model as the omni below.

Infrastructure required

Inference

Input

  • Video input pipeline: frame sampler (typically 1 FPS), audio splitter for full-video understanding.
  • File picker via fs.user-selected.

Output

  • Streaming token Channel (text answer).

Storage

  • ❌ Weights cache.
  • Per-session: extracted frames + audio embeddings cached on disk.

Interaction (IPC + SDK)

  • video.ask({ path, question }) IPC.

Capabilities (manifest)

  • capabilities.fs.user-selected for input.
  • capabilities.models[] for the VLM.

Gaps

  • Video input pipeline (frame sampler, audio splitter).
  • VLM inference for video models.

See also