`any-to-any` (multimodal-omni)

HF group: Multimodal · Status: ⏳ skeleton crate

What it is

Any-input → text + audio. The “GPT-4o” shape: single model that natively handles text, image, audio, video on both sides. Distinct from a pipeline of specialists.

Highest-leverage single integration. Successfully wiring Qwen3.5-Omni gets us decent coverage for seven modalities at once: text-to-text, text-to-text-thinking, image-text-to-text, video-text-to-text, audio-text-to-text, voice-to-voice, and this one.

Open-weight models

Model	Params	Released	License	Quality	Notes
Qwen2.5-Omni-7B	7 B	2025	Apache-2.0	First open omni	Audio in via `mtmd`; audio out gated.
Qwen3-Omni-30B-A3B	30 B / 3 B active	2026-Q1	Apache-2.0	Real-time stream w/ speech out	Thinker-Talker MoE architecture.
Qwen3.5-Omni	~30 B	2026-03	Apache-2.0	113 langs ASR, 36 langs speech-out	Hybrid-Attention MoE. Best open omni today. Native turn-taking intent recognition.

Infrastructure required

Inference

⏳ crates/locara-qwen-omni skeleton. Audio-input via llama.cpp mtmd planned but not wired. Audio-output blocked on llama.cpp’s mtmd not supporting it (PR ggml-org/llama.cpp#13784 tracks).
🟡 Realistically the closest path is to wire Qwen3.5-Omni through a moshi_mlx-style subprocess (proven pattern in locara-moshi).

Input

Union of every modality’s input: text, mic + system-audio, image (file/camera), video (file).
✅ Audio capture (locara-microphone, locara-screencapture-audio).
❌ Image + video input pipelines.

Output

🟡 Multiplexed: audio playback queue (shared with voice-to-voice) + streaming token Channel + image/video file save.
❌ A new OmniBackend trait with an input-shape enum (more general than today’s VoiceBackend).

Storage

❌ Weights cache (large — 30B MoE).
Per-session state holds the cross-modal context.

Interaction (IPC + SDK)

❌ omni.session_* IPC family (wider than voice.session_* — accepts image/video frames as input).
Picker UI exposes “what can this model do” without listing 20 capability flags. The any-to-any modality declaration is the right grain.

Capabilities (manifest)

All inputs declared: device.microphone, device.camera (cool-down semantics needed), fs.user-selected.
All outputs: device.speaker (cool-down), file-save to fs.user-folder.
models[] for the omni model.

Gaps

Most of it. Realistically the closest path is to wire Qwen3.5-Omni through the moshi_mlx-style subprocess pattern via the upstream Qwen demo (full Python stack inside the helper). OR wait for llama.cpp mtmd’s audio-output support.

any-to-any (multimodal-omni)