any-to-any (multimodal-omni)
HF group: Multimodal · Status: ⏳ skeleton crate
What it is
Any-input → text + audio. The “GPT-4o” shape: single model that natively handles text, image, audio, video on both sides. Distinct from a pipeline of specialists.
Highest-leverage single integration. Successfully wiring
Qwen3.5-Omni gets us decent coverage for seven modalities at
once: text-to-text,
text-to-text-thinking,
image-text-to-text,
video-text-to-text,
audio-text-to-text,
voice-to-voice, and this one.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Qwen2.5-Omni-7B | 7 B | 2025 | Apache-2.0 | First open omni | Audio in via mtmd; audio out gated. |
| Qwen3-Omni-30B-A3B | 30 B / 3 B active | 2026-Q1 | Apache-2.0 | Real-time stream w/ speech out | Thinker-Talker MoE architecture. |
| Qwen3.5-Omni | ~30 B | 2026-03 | Apache-2.0 | 113 langs ASR, 36 langs speech-out | Hybrid-Attention MoE. Best open omni today. Native turn-taking intent recognition. |
Infrastructure required
Inference
- ⏳
crates/locara-qwen-omniskeleton. Audio-input via llama.cppmtmdplanned but not wired. Audio-output blocked on llama.cpp’smtmdnot supporting it (PR ggml-org/llama.cpp#13784 tracks). - 🟡 Realistically the closest path is to wire Qwen3.5-Omni through a
moshi_mlx-style subprocess (proven pattern inlocara-moshi).
Input
- Union of every modality’s input: text, mic + system-audio, image (file/camera), video (file).
- ✅ Audio capture (
locara-microphone,locara-screencapture-audio). - ❌ Image + video input pipelines.
Output
- 🟡 Multiplexed: audio playback queue (shared with voice-to-voice) + streaming token Channel + image/video file save.
- ❌ A new
OmniBackendtrait with an input-shape enum (more general than today’sVoiceBackend).
Storage
- ❌ Weights cache (large — 30B MoE).
- Per-session state holds the cross-modal context.
Interaction (IPC + SDK)
- ❌
omni.session_*IPC family (wider thanvoice.session_*— accepts image/video frames as input). - Picker UI exposes “what can this model do” without listing 20 capability flags. The
any-to-anymodality declaration is the right grain.
Capabilities (manifest)
- All inputs declared:
device.microphone,device.camera(cool-down semantics needed),fs.user-selected. - All outputs:
device.speaker(cool-down), file-save tofs.user-folder. models[]for the omni model.
Gaps
Most of it. Realistically the closest path is to wire
Qwen3.5-Omni through the moshi_mlx-style subprocess pattern via
the upstream Qwen demo (full Python stack inside the helper). OR
wait for llama.cpp mtmd’s audio-output support.
See also
voice-to-voice— supersetimage-text-to-textvideo-text-to-textaudio-text-to-text- Crates:
locara-qwen-omni,locara-moshi(pattern reference) - Index:
../modalities-and-models-survey.md