`audio-text-to-text`

HF group: Multimodal · Status: ❌ not built

What it is

Audio + text prompt → text answer. Examples: “transcribe this with timestamps and label speakers”, “summarize this meeting recording”, “did the speaker mention X?”.

Distinct from speech-to-text (pure transcription) because the text prompt steers what gets returned — a single end-to-end model interprets audio and answers a question about it.

Open-weight models

Model	Params	Released	License	Quality	Notes
Qwen2.5-Omni-7B (audio in only)	7 B	2025	Apache-2.0	Strong audio Q&A	llama.cpp’s `mtmd` supports audio input.
Qwen3-Omni-30B-A3B	30 B / 3 B active	2026-Q1	Apache-2.0	Best open audio-text-to-text	Real-time streaming, also handles voice-to-voice.
MOSS-Audio	~8 B	2026-04	Apache-2.0	Speech / sound / music / time-aware reasoning	OpenMOSS, fresh.
Whisper + LLM (composed)	varies	n/a	MIT + LLM	Decent	Pipeline: ASR transcript → LLM. Works today.

Infrastructure required

Inference

❌ VLM-style multimodal LLM inference path (audio chunks fused with text tokens). llama.cpp’s mtmd covers some.
✅ Pipeline fallback works today: locara-whisper → locara-llama composed in app code.

Input

✅ Audio capture / file load (shared with speech-to-text — locara-microphone, file picker).
Plain text prompt.

Output

✅ Streaming token Channel (same shape as text-to-text).

Storage

✅ Weights via locara-models::Cache.
❌ Audio-aware session state for multi-turn audio Q&A (not built; pipeline fallback drops audio after each turn).

Interaction (IPC + SDK)

❌ audio.qa IPC, OR extend llm.chat to accept audio chunks.
Today: apps compose transcribe.from_pcm + llm.chat themselves.

Capabilities (manifest)

capabilities.device.microphone (live) or fs.user-selected (file).
capabilities.models[] for the omni model.

Gaps

First-class support requires the same audio-input plumbing planned for any-to-any. Probably ships together (Qwen3.5-Omni subprocess covers both).

audio-text-to-text