audio-text-to-text
HF group: Multimodal · Status: ❌ not built
What it is
Audio + text prompt → text answer. Examples: “transcribe this with timestamps and label speakers”, “summarize this meeting recording”, “did the speaker mention X?”.
Distinct from speech-to-text (pure
transcription) because the text prompt steers what gets returned
— a single end-to-end model interprets audio and answers a
question about it.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Qwen2.5-Omni-7B (audio in only) | 7 B | 2025 | Apache-2.0 | Strong audio Q&A | llama.cpp’s mtmd supports audio input. |
| Qwen3-Omni-30B-A3B | 30 B / 3 B active | 2026-Q1 | Apache-2.0 | Best open audio-text-to-text | Real-time streaming, also handles voice-to-voice. |
| MOSS-Audio | ~8 B | 2026-04 | Apache-2.0 | Speech / sound / music / time-aware reasoning | OpenMOSS, fresh. |
| Whisper + LLM (composed) | varies | n/a | MIT + LLM | Decent | Pipeline: ASR transcript → LLM. Works today. |
Infrastructure required
Inference
- ❌ VLM-style multimodal LLM inference path (audio chunks fused with text tokens). llama.cpp’s
mtmdcovers some. - ✅ Pipeline fallback works today:
locara-whisper→locara-llamacomposed in app code.
Input
- ✅ Audio capture / file load (shared with
speech-to-text—locara-microphone, file picker). - Plain text prompt.
Output
- ✅ Streaming token Channel (same shape as text-to-text).
Storage
- ✅ Weights via
locara-models::Cache. - ❌ Audio-aware session state for multi-turn audio Q&A (not built; pipeline fallback drops audio after each turn).
Interaction (IPC + SDK)
- ❌
audio.qaIPC, OR extendllm.chatto accept audio chunks. - Today: apps compose
transcribe.from_pcm+llm.chatthemselves.
Capabilities (manifest)
capabilities.device.microphone(live) orfs.user-selected(file).capabilities.models[]for the omni model.
Gaps
First-class support requires the same audio-input plumbing
planned for any-to-any. Probably ships
together (Qwen3.5-Omni subprocess covers both).
See also
speech-to-textany-to-any— superset- Crates:
locara-microphone,locara-whisper,locara-llama - Index:
../modalities-and-models-survey.md