Locara

any-to-any (multimodal-omni)

HF group: Multimodal · Status: ⏳ skeleton crate

What it is

Any-input → text + audio. The “GPT-4o” shape: single model that natively handles text, image, audio, video on both sides. Distinct from a pipeline of specialists.

Highest-leverage single integration. Successfully wiring Qwen3.5-Omni gets us decent coverage for seven modalities at once: text-to-text, text-to-text-thinking, image-text-to-text, video-text-to-text, audio-text-to-text, voice-to-voice, and this one.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Qwen2.5-Omni-7B7 B2025Apache-2.0First open omniAudio in via mtmd; audio out gated.
Qwen3-Omni-30B-A3B30 B / 3 B active2026-Q1Apache-2.0Real-time stream w/ speech outThinker-Talker MoE architecture.
Qwen3.5-Omni~30 B2026-03Apache-2.0113 langs ASR, 36 langs speech-outHybrid-Attention MoE. Best open omni today. Native turn-taking intent recognition.

Infrastructure required

Inference

  • crates/locara-qwen-omni skeleton. Audio-input via llama.cpp mtmd planned but not wired. Audio-output blocked on llama.cpp’s mtmd not supporting it (PR ggml-org/llama.cpp#13784 tracks).
  • 🟡 Realistically the closest path is to wire Qwen3.5-Omni through a moshi_mlx-style subprocess (proven pattern in locara-moshi).

Input

  • Union of every modality’s input: text, mic + system-audio, image (file/camera), video (file).
  • ✅ Audio capture (locara-microphone, locara-screencapture-audio).
  • ❌ Image + video input pipelines.

Output

  • 🟡 Multiplexed: audio playback queue (shared with voice-to-voice) + streaming token Channel + image/video file save.
  • ❌ A new OmniBackend trait with an input-shape enum (more general than today’s VoiceBackend).

Storage

  • ❌ Weights cache (large — 30B MoE).
  • Per-session state holds the cross-modal context.

Interaction (IPC + SDK)

  • omni.session_* IPC family (wider than voice.session_* — accepts image/video frames as input).
  • Picker UI exposes “what can this model do” without listing 20 capability flags. The any-to-any modality declaration is the right grain.

Capabilities (manifest)

  • All inputs declared: device.microphone, device.camera (cool-down semantics needed), fs.user-selected.
  • All outputs: device.speaker (cool-down), file-save to fs.user-folder.
  • models[] for the omni model.

Gaps

Most of it. Realistically the closest path is to wire Qwen3.5-Omni through the moshi_mlx-style subprocess pattern via the upstream Qwen demo (full Python stack inside the helper). OR wait for llama.cpp mtmd’s audio-output support.

See also