Locara

audio-text-to-text

HF group: Multimodal · Status: ❌ not built

What it is

Audio + text prompt → text answer. Examples: “transcribe this with timestamps and label speakers”, “summarize this meeting recording”, “did the speaker mention X?”.

Distinct from speech-to-text (pure transcription) because the text prompt steers what gets returned — a single end-to-end model interprets audio and answers a question about it.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Qwen2.5-Omni-7B (audio in only)7 B2025Apache-2.0Strong audio Q&Allama.cpp’s mtmd supports audio input.
Qwen3-Omni-30B-A3B30 B / 3 B active2026-Q1Apache-2.0Best open audio-text-to-textReal-time streaming, also handles voice-to-voice.
MOSS-Audio~8 B2026-04Apache-2.0Speech / sound / music / time-aware reasoningOpenMOSS, fresh.
Whisper + LLM (composed)variesn/aMIT + LLMDecentPipeline: ASR transcript → LLM. Works today.

Infrastructure required

Inference

  • ❌ VLM-style multimodal LLM inference path (audio chunks fused with text tokens). llama.cpp’s mtmd covers some.
  • ✅ Pipeline fallback works today: locara-whisperlocara-llama composed in app code.

Input

  • ✅ Audio capture / file load (shared with speech-to-textlocara-microphone, file picker).
  • Plain text prompt.

Output

  • ✅ Streaming token Channel (same shape as text-to-text).

Storage

  • ✅ Weights via locara-models::Cache.
  • ❌ Audio-aware session state for multi-turn audio Q&A (not built; pipeline fallback drops audio after each turn).

Interaction (IPC + SDK)

  • audio.qa IPC, OR extend llm.chat to accept audio chunks.
  • Today: apps compose transcribe.from_pcm + llm.chat themselves.

Capabilities (manifest)

  • capabilities.device.microphone (live) or fs.user-selected (file).
  • capabilities.models[] for the omni model.

Gaps

First-class support requires the same audio-input plumbing planned for any-to-any. Probably ships together (Qwen3.5-Omni subprocess covers both).

See also