Local voice-to-voice and unified-modality SLMs (research note, 2026-Q1)
What this is: State of the art for on-device voice agents, focused on what Locara should add next after its existing text-LLM + embeddings + Whisper + Apple Vision OCR stack.
Status: Reference document. Pair with spec/04-modalities.md (the modality schema) and the eventual locara-voice crate.
Most relevant to Locara: Pipeline-by-default in v1 with Apple-API fallbacks; unified-omni models (Moshi-MLX, NVIDIA persona*, Qwen-Omni) reachable via the same SDK shape once a concrete backend lands.
Implementation status (2026-05-03 update)
The integration scaffolding is in place; what’s missing is a concrete VoiceBackend implementation for any specific model. Concretely:
- Trait (
crates/locara-core/src/voice.rs):VoiceBackenddefinesstart_session/push_audio/event_stream/stop_session. Events:Audio { samples, sample_rate },AssistantPartial,UserPartial,TurnEnd,Error. Tested. - Stub (
StubOmniVoiceBackendin the same file): refuses every call with a structured pointer to this note + BACKLOG. The runtime’svoice.session_*IPC dispatches here when no concrete backend is wired, so apps usingvoice.session({omni: ...})get a clear failure mode rather than a generic “not implemented”. - IPC (
crates/locara-runtime/src/tauri_plugin.rs): four commands —voice.session_start,voice.session_push,voice.session_events,voice.session_stop. Capability-checked. Wired into the registered command list. Reflected in apps/voice’s permissions/locara.toml. - SDK (
packages/sdk/src/voice.ts):voice.session({stt, llm, tts})for the pipeline form (works today, composestranscribe.stream+llm.chatStream+ a TTS sink);voice.session({omni: {model, ...}})for the omni form (calls the IPC; with no backend wired, surfaces the stub’s error as aVoiceEvent.error). - App (
apps/voice/): runs the pipeline form with sentence-level streaming TTS so the perceived latency approximates a true voice-to-voice model.
To plug in a concrete model — e.g. nvidia/personaplex-7b-v1, kyutai/moshiko, Qwen/Qwen2.5-Omni-7B — write a new crate (locara-personaplex, locara-moshi, etc.) that:
- Loads the model weights via Candle, MLX-rust, or whatever the model’s native runtime is.
- Implements
VoiceBackendfromlocara-core. - Apps register it on
LocaraStateviawith_voice_backend(...)at startup.
No SDK or app changes required when this lands — the surface is already stable.
1. What “voice-to-voice” actually means
Two architectures compete in this space, and the gap between them defines almost every design tradeoff.
End-to-end (E2E) speech models ingest audio tokens directly and emit audio tokens directly, with the language model reasoning over a multi-stream representation that includes both text and acoustic codebooks. There is no transcript on the critical path. Moshi, Mini-Omni, Mini-Omni2, GLM-4-Voice, Step-Audio, and Qwen2.5-Omni / Qwen3-Omni are the public examples. The breakthrough that makes them viable is treating speech as a sequence of neural codec tokens (Mimi for Moshi, SNAC for Mini-Omni) interleaved with text tokens — the LM predicts both streams in parallel, so first-audio-out can happen before a full text response is generated.
Pipeline architectures (ASR → LLM → TTS) are the conservative choice — three independent models hand text between them. Quality of each component is independently tunable, latency stacks (~150 ms VAD + ~100–300 ms ASR + ~LLM-TTFT + ~80–200 ms TTS first-chunk = typically 600–1200 ms first-audio), and total RAM is the sum of three models. Every shipping local voice assistant in early 2026 (sherpa-onnx demos, Open WebUI voice, sokuji-tsuyaku, Granola’s local mode) is still pipeline-based — E2E remains research-grade for production deployment.
The honest summary as of Q1 2026: E2E exists and works, but production-grade open weights are limited to Moshi (Kyutai) for true full-duplex, and Qwen2.5-Omni / Qwen3-Omni for half-duplex. Mini-Omni is a research artifact; GLM-4-Voice and Step-Audio are usable but Chinese-leaning. Pipeline is what 99% of local voice apps still ship.
2. Specific local-runnable models
End-to-end / unified
| Model | Params | Quant | RAM (real) | Local runtime | Status Q1 2026 |
|---|---|---|---|---|---|
| Moshi (Moshiko/Moshika) | 7B | int4 / int8 / bf16 | ~6 GB int4, ~14 GB bf16 | PyTorch, MLX (int4/int8/bf16), Rust/Candle | Apache-2.0; ~200 ms practical latency; full-duplex; English-only |
| Kyutai TTS / Pocket TTS | 100M (Pocket) / 1B+ (full) | fp16 | <1 GB (Pocket) | Rust/Candle, MLX | Pocket TTS released Jan 2026, CPU real-time |
| Mini-Omni / Mini-Omni2 | 0.5B base + audio heads | fp16 only | ~3 GB | PyTorch reference | Research-grade; no GGUF; no real maintained Apple Silicon path |
| GLM-4-Voice | 9B (LM) + tokenizer + decoder | fp16/int4 | ~12 GB int4 | PyTorch | Strong CN/EN; Apache-2.0; no GGUF audio output |
| Step-Audio | 130B teacher / smaller distilled variants | int4 | varies | PyTorch | Mostly cloud; smaller variants exist but weights heavy |
| Qwen2.5-Omni | 3B and 7B | Q4_K_M (4.7 GB), Q8_0 (8.1 GB) on 7B | 5–10 GB | llama.cpp via llama-mtmd-cli (audio IN only), PyTorch (full) | GGUF supports audio + vision input; audio output is NOT supported in llama.cpp as of early 2026 |
| Qwen3-Omni | 7B+ | varies | larger | PyTorch | End-to-end omni; same llama.cpp limitation expected to persist |
Pipeline components
STT (Whisper family via whisper.cpp, all Core ML / ANE accelerated)
| Variant | Disk | RAM | Notes |
|---|---|---|---|
tiny / tiny.en | 75 MB | ~1 GB | Real-time on RPi-class hardware |
base / base.en | 142 MB | ~1 GB | Best speed/quality for live captions |
small / small.en | 466 MB | ~2 GB | The sweet spot for most apps |
medium / medium.en | 1.5 GB | ~5 GB | Diminishing returns vs. small.en for English |
large-v3 / large-v3-turbo | 1.5–3 GB | 6–10 GB | Best quality; turbo variants designed for streaming |
Whisper.cpp + Core ML encoder gives roughly 3× speedup vs CPU on Apple Silicon (ANE). Locara already ships this.
Newer ASR options worth tracking
- Apple
SpeechAnalyzer(macOS Tahoe / iOS 26, June 2025): replacesSFSpeechRecognizer, fully on-device, modular (SpeechTranscriber+SpeechDetector), tuned for long-form. Independent benchmarks (Yap, MacRumors test with a 34-min file): ~45 s vs MacWhisper Large-V3-Turbo at ~101 s — ~55% faster with comparable quality. Requires macOS 15.0+. - Moshi STT (extracted Mimi + ASR head): streaming, sub-100 ms partial latency. New in 2025; not as battle-tested as Whisper but real-time-first by design.
TTS — local options
| Model | Params | RAM | Quality | Speed |
|---|---|---|---|---|
| Piper | <50 MB per voice | <500 MB | Good (parametric) | Real-time on CPU; edge-friendly |
| Kokoro-82M | 82M | ~500 MB | Very good | <0.3 s for short utterance; ~36× real-time on free Colab GPU; ~30–45 s for 1500 words on M1 Air 8 GB |
| F5-TTS | ~330M | ~2 GB | Excellent (zero-shot voice clone) | Slower than Kokoro |
| XTTS-v2 | ~750M | ~3 GB | Excellent (multilingual + clone) | Coqui project archived; community fork |
| Orpheus 3B | 3B | ~6 GB int4 | Best emotional range (laugh, cry, whisper) | Heaviest of the bunch; needs GPU/ANE for real-time |
Apple AVSpeechSynthesizer | (system) | 0 (system-managed) | Good with neural voices; great with Personal Voice | Real-time, free, system audio routing handled |
| Kyutai Pocket TTS | 100M | <1 GB | Good | CPU real-time |
llama.cpp / GGUF compatibility — concrete answer
llama.cpp supports audio input for Qwen2.5-Omni via llama-mtmd-cli and llama-server (PR #13784 merged mid-2025). It does not support audio output / speech generation for any unified-modality model — the codec-decoder side has not been ported. So Locara cannot get a single-binary llama.cpp pipeline that does end-to-end voice. Pipeline is forced, regardless of which omni-model you pick, if llama.cpp is the only inference engine.
3. Latency and quality trade-offs
The Moshi paper (arXiv 2410.00037) is the canonical reference on E2E latency and explains the win clearly: a 7B language model outputs discrete acoustic codec tokens (Mimi, 12.5 Hz, ~1100 bps) in parallel with text tokens via an “Inner Monologue” stream. Theoretical latency is ~160 ms; measured ~200 ms. Two enabling moves:
- Streaming neural audio codec — Mimi encodes/decodes in <40 ms windows, so the first audio token can be played before the full response is decided.
- Multi-stream parallel decoding — text and audio share one transformer; speaker turns are not explicit (no “now my turn” handoff), so the model handles barge-in / interruption natively.
Mini-Omni (arXiv 2408.16725) takes a different shortcut — the “Any Model Can Talk” framework adds audio output heads to a pretrained text LM, training only the new heads while preserving language quality. They use SNAC (8 codebooks, hundreds of tokens/sec) with text-instructed delayed parallel generation to avoid drowning the LM in long codebook sequences. Their measured TTFT for audio is ~300 ms — slower than Moshi but with a much smaller base model.
Pipeline latency is bounded below by VAD endpointing (~150 ms minimum to confirm end-of-utterance) plus ASR finalization plus LLM TTFT plus TTS first-chunk. Even with whisper.cpp small.en + Llama-3.2-1B + Kokoro, you’re looking at ~600–900 ms end-to-end on M-series Macs. The thing pipeline gets that E2E does not: independently swappable, independently quantizable components with mature tooling and clear failure modes.
Practical recommendation backed by what’s shipping: pipeline today, with an upgrade path to Moshi-MLX once it stabilizes.
4. Native macOS APIs — what to actually use
The two relevant Apple frameworks both run fully on-device, both are free (no model download), both work in Tauri via a Swift sidecar or objc2 Rust bindings.
| API | Use case | Trade-off vs third-party |
|---|---|---|
SFSpeechRecognizer (legacy) | Short utterances, command/control | Whisper.cpp wins on long-form quality and is more configurable |
SpeechAnalyzer + SpeechTranscriber (macOS 15+) | Long-form transcription, lectures, meetings | Often beats whisper-large-v3-turbo on speed; comparable quality. Locale-by-locale model download managed by the OS (zero cost to Locara). Strong default for English / supported locales. |
AVSpeechSynthesizer | TTS for any text | Free, system-routed, supports Personal Voice (with requestPersonalVoiceAuthorization). Lower expressive range than Kokoro/Orpheus. Integrates with VoiceOver. |
SpeechDetector | VAD / endpointing | Replaces hand-rolled energy thresholds; pairs naturally with SpeechTranscriber |
The right architecture for Locara’s voice-to-voice modality on macOS is therefore: default to SpeechAnalyzer for STT and AVSpeechSynthesizer for TTS, with opt-in upgrades to whisper.cpp and Kokoro/Piper/Moshi-MLX. This mirrors how Locara already uses Apple Vision for OCR by default with a fallback to GLM-OCR / RapidOCR.
5. Wiring design for the manifest
Should voice-to-voice be a top-level modality?
Yes — and the existing spec already lists it (spec/04-modalities.md). The expansion needs updating to reflect Q1 2026 reality:
voice-to-voice → device.microphone
+ device.speaker (NEW — see below)
+ audio.record + audio.play SDK
+ one of:
(a) STT model + LLM model + TTS model (pipeline default)
(b) unified omni model (e.g., moshi-7b) (E2E option)
(c) Apple SpeechAnalyzer + AVSpeech (zero-model fallback)
+ voice.* SDK module
Critically: keep speech-to-text and text-to-speech as separate first-class modalities. Apps that only need one (a transcription tool, a screen reader) shouldn’t pull the whole voice-to-voice expansion. Apps that need full duplex declare voice-to-voice. The expansion is a convenience bundle, not a replacement for the granular modalities.
Capability grant: do we need device.speaker?
The W3C Audio Output Devices API and the corresponding Permissions-Policy: speaker-selection directive (W3C Candidate Recommendation Draft, 2025-10-09) treat speaker access as a permissioned feature with a real threat model: a malicious app could blast loud audio out of the laptop’s speakers when the user is wearing headphones, or route audio through unintended output devices. The W3C spec specifically calls out the “library laptop with USB headset” scenario.
For Locara this maps to:
device.speaker: true— grants the right to play audio through the default output device. macOS does not have a TCC permission for audio playback per se, but the principle of declaring intent is consistent with Locara’s other capabilities.device.speaker.select: true— separately required to enumerate or select non-default audio output devices (extending the W3C model). Probably defer to v2.
The threat model: even without select, an app with speaker: true can be a nuisance (random audio bursts, unwanted TTS). Mitigations:
- Runtime audio output is gated through the Locara plugin, which can rate-limit, fade, and respect a global “audio-allowed” toggle.
- Per the cool-down rules in
spec/03-capabilities.md, addingdevice.speakerto an existing app on update triggers 7-day cool-down + re-consent. - The
voice-to-voiceandtext-to-speechmodality expansions should auto-grantdevice.speaker, so most apps never see this capability directly.
SDK surface
The existing SDK in spec/05-sdk.md follows a strict pattern: one module per modality, with both one-shot and streaming variants (e.g., transcribe.fromFile + transcribe.stream). The voice modality should follow the same pattern, not invent a new “agent runner” abstraction.
import { voice } from '@locara/sdk'
// Pipeline form (default expansion): explicit, debuggable
const session = voice.session({
stt: { model: 'apple-speech-analyzer' }, // or 'whisper-large-v3-turbo'
llm: { model: 'qwen2.5-3b-instruct-q4', system: '...' },
tts: { model: 'apple-avspeech' }, // or 'kokoro-82m'
})
await session.start() // requests mic + speaker grants
for await (const ev of session.events()) {
// ev: { type: 'partial-transcript' | 'final-transcript' | 'llm-token' |
// 'audio-chunk' | 'turn-end' | 'barge-in' | 'error' }
}
// E2E form (when an omni model is selected): same shape, fewer events
const session = voice.session({ omni: { model: 'moshi-7b-mlx-int4' } })
Reasons to prefer voice.session({...}) over agent.runVoice(...) or voice.converse(...):
- Symmetry with
transcribe.live,llm.chatStream,db.transaction— Locara’s SDK is already module-shaped, not agent-shaped. - Inspectability — the manifest pinned which models will be used; the SDK call should reflect those names so static analysis (
spec/03-capabilities.md) can verify that referenced models are declared. - The same call shape works for pipeline and E2E — apps don’t have to rewrite their code to upgrade from “Apple defaults” to “Moshi-MLX” once it’s available.
A Float32Array async-iterable for raw audio is the right low-level primitive, but most apps shouldn’t see audio bytes. They should see semantic events (transcripts, model output, turn boundaries). Expose raw audio as session.rawInput() / session.rawOutput() for advanced cases (recording, custom UI visualization).
6. Public local voice-agent demos to mirror
- sherpa-onnx-go-macos (k2-fsa) — real-time voice assistant in Go using sherpa-onnx (STT + TTS + VAD), Whisper, Kokoro, Ollama. Pipeline pattern, fully local. Worth reading the audio plumbing.
- Carlos Mbendera’s Sherpa-Onnx Swift integration (Medium, 2025) — shows the Swift binding pattern for Apple Silicon; useful for a Tauri Swift sidecar.
- Granola voice mode — closed-source but architecturally a pipeline; their UX for partial transcripts + barge-in is widely-copied.
- MacWhisper — pipeline only, but the gold standard for “whisper-on-Mac” UX.
- Argmax WhisperKit + SpeechAnalyzer comparison — Argmax’s blog explicitly benchmarks the two; the takeaway is that SpeechAnalyzer wins on fresh-install latency but WhisperKit wins on configurability.
- Yap (CLI, MacRumors-tested): minimal example of using
SpeechAnalyzerfor batch transcription — useful as a reference for the SpeechAnalyzer Swift sidecar Locara would write.
No public demo yet uses Moshi end-to-end on Apple Silicon as a daily-driver voice agent — Moshi-MLX exists, but the ecosystem hasn’t shipped a polished consumer-facing app on top of it.
Specific Locara learnings
-
Ship
voice-to-voiceas a pipeline-by-default modality in v1, with the expansion picking AppleSpeechAnalyzer(STT) + Llama-3.2-3B-Instruct-q4 (LLM) +AVSpeechSynthesizer(TTS) on macOS 15+, falling back to whisper.cpp + Kokoro on older macOS. This gives a working zero-extra-download voice agent on every supported Mac, then lets opinionated apps override. -
Do not block on Moshi/Qwen-Omni for v1. llama.cpp supports omni-model audio input but not audio output as of Q1 2026 (see PR ggml-org/llama.cpp#13784 + Issue #12673). The fastest route to “real” E2E is a separate
locara-moshicrate using MLX-Rust or Candle, added in a later milestone. Keep the modality manifest stable so apps don’t have to rewrite when E2E lands. -
Add
device.speakeras a new capability inspec/03-capabilities.md, mapped to a Locara runtime gate (no native macOS TCC equivalent, but the W3Cspeaker-selectionmodel gives a clean threat-model story). Bothtext-to-speechandvoice-to-voicemodality expansions auto-grant it. Addingdevice.speakeron update triggers the existing 7-day cool-down rule. -
Use the existing
voice.session({...})shape, not anagent.runVoicewrapper. Same call signature must accept either{stt, llm, tts}(pipeline) or{omni}(E2E), so apps don’t fork code paths when upgrading. Emit semantic events (partial-transcript,audio-chunk,turn-end,barge-in) instead of raw audio frames; expose raw frames viarawInput()/rawOutput()for advanced cases. -
Default STT should be
SpeechAnalyzeron macOS 15+, whisper.cppsmall.en+ Core ML elsewhere. SpeechAnalyzer is ~55% faster than whisper-large-v3-turbo with comparable quality (per Yap/MacRumors benchmarks), zero model download for the user, and Apple manages locale models. Keep whisper.cpp as the override for cross-version consistency, multilingual support beyond Apple’s locales, and verifiable open-source provenance. -
Default TTS should be
AVSpeechSynthesizer(with optional Personal Voice). Free, system-routed, accessible, no model download. Reserve Kokoro/Piper/Orpheus for apps that need controlled voice quality or voice cloning — gate those behind explicit model declarations in the manifest, since they cost RAM and disk. -
For interruption / barge-in: bake it in. Pipeline implementations can do it via VAD ducking; E2E models (Moshi) handle it natively. Either way, apps should not have to wire it themselves — the
voice.sessionshould emitbarge-inevents and pause TTS automatically. This is the primary UX differentiator vs. naive pipelines. -
Plan a
locara-voicecrate that owns the audio I/O, VAD, and turn-taking state machine — keeping it out oflocara-llamaandlocara-whisperso the same state machine is reused whether the LLM is text-only or omni. The crate’s job is “raw audio in/out + turn boundaries”; the modality expansion wires it to whatever model triple (or single omni model) the manifest names.
References
- Moshi: a speech-text foundation model for real-time dialogue (arXiv 2410.00037)
- Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (arXiv 2408.16725)
- Mini-Omni2: Towards Open-source GPT-4o Model with Vision, Speech and Duplex (arXiv 2410.11190)
- kyutai-labs/moshi GitHub repo
- Kyutai TTS / Pocket TTS (Kyutai, Jan 2026)
- QwenLM/Qwen2.5-Omni GitHub
- QwenLM/Qwen3-Omni GitHub
- llama.cpp PR #13784 — mtmd: support Qwen 2.5 Omni (input audio+vision, no audio output)
- llama.cpp Issue #12673 — Feature Request: Qwen2.5-Omni
- ggml-org/Qwen2.5-Omni-7B-GGUF
- whisper.cpp (ggml-org)
- Apple SpeechAnalyzer documentation
- WWDC25 — Bring advanced speech-to-text to your app with SpeechAnalyzer
- MacRumors — Apple’s New Transcription APIs Blow Past Whisper in Speed Tests
- Apple SFSpeechRecognizer documentation
- Apple AVSpeechSynthesizer documentation
- WWDC23 — Extend Speech Synthesis with personal and custom voices (Personal Voice)
- W3C Audio Output Devices API (CR Draft, 2025-10-09)
- MDN — Permissions-Policy: speaker-selection
- k2-fsa/sherpa-onnx GitHub
- agalue/sherpa-voice-assistant — local Go voice assistant
- Local TTS Guide 2026 — LocalClaw
- 12 Best Open-Source TTS Models Compared (Inferless)