Local voice-to-voice and unified-modality SLMs (research note, 2026-Q1)

What this is: State of the art for on-device voice agents, focused on what Locara should add next after its existing text-LLM + embeddings + Whisper + Apple Vision OCR stack.

Status: Reference document. Pair with spec/04-modalities.md (the modality schema) and the eventual locara-voice crate.

Most relevant to Locara: Pipeline-by-default in v1 with Apple-API fallbacks; unified-omni models (Moshi-MLX, NVIDIA persona*, Qwen-Omni) reachable via the same SDK shape once a concrete backend lands.

Implementation status (2026-05-03 update)

The integration scaffolding is in place; what’s missing is a concrete VoiceBackend implementation for any specific model. Concretely:

Trait (crates/locara-core/src/voice.rs): VoiceBackend defines start_session / push_audio / event_stream / stop_session. Events: Audio { samples, sample_rate }, AssistantPartial, UserPartial, TurnEnd, Error. Tested.
Stub (StubOmniVoiceBackend in the same file): refuses every call with a structured pointer to this note + BACKLOG. The runtime’s voice.session_* IPC dispatches here when no concrete backend is wired, so apps using voice.session({omni: ...}) get a clear failure mode rather than a generic “not implemented”.
IPC (crates/locara-runtime/src/tauri_plugin.rs): four commands — voice.session_start, voice.session_push, voice.session_events, voice.session_stop. Capability-checked. Wired into the registered command list. Reflected in apps/voice’s permissions/locara.toml.
SDK (packages/sdk/src/voice.ts): voice.session({stt, llm, tts}) for the pipeline form (works today, composes transcribe.stream + llm.chatStream + a TTS sink); voice.session({omni: {model, ...}}) for the omni form (calls the IPC; with no backend wired, surfaces the stub’s error as a VoiceEvent.error).
App (apps/voice/): runs the pipeline form with sentence-level streaming TTS so the perceived latency approximates a true voice-to-voice model.

To plug in a concrete model — e.g. nvidia/personaplex-7b-v1, kyutai/moshiko, Qwen/Qwen2.5-Omni-7B — write a new crate (locara-personaplex, locara-moshi, etc.) that:

Loads the model weights via Candle, MLX-rust, or whatever the model’s native runtime is.
Implements VoiceBackend from locara-core.
Apps register it on LocaraState via with_voice_backend(...) at startup.

No SDK or app changes required when this lands — the surface is already stable.

1. What “voice-to-voice” actually means

Two architectures compete in this space, and the gap between them defines almost every design tradeoff.

End-to-end (E2E) speech models ingest audio tokens directly and emit audio tokens directly, with the language model reasoning over a multi-stream representation that includes both text and acoustic codebooks. There is no transcript on the critical path. Moshi, Mini-Omni, Mini-Omni2, GLM-4-Voice, Step-Audio, and Qwen2.5-Omni / Qwen3-Omni are the public examples. The breakthrough that makes them viable is treating speech as a sequence of neural codec tokens (Mimi for Moshi, SNAC for Mini-Omni) interleaved with text tokens — the LM predicts both streams in parallel, so first-audio-out can happen before a full text response is generated.

Pipeline architectures (ASR → LLM → TTS) are the conservative choice — three independent models hand text between them. Quality of each component is independently tunable, latency stacks (~150 ms VAD + ~100–300 ms ASR + ~LLM-TTFT + ~80–200 ms TTS first-chunk = typically 600–1200 ms first-audio), and total RAM is the sum of three models. Every shipping local voice assistant in early 2026 (sherpa-onnx demos, Open WebUI voice, sokuji-tsuyaku, Granola’s local mode) is still pipeline-based — E2E remains research-grade for production deployment.

The honest summary as of Q1 2026: E2E exists and works, but production-grade open weights are limited to Moshi (Kyutai) for true full-duplex, and Qwen2.5-Omni / Qwen3-Omni for half-duplex. Mini-Omni is a research artifact; GLM-4-Voice and Step-Audio are usable but Chinese-leaning. Pipeline is what 99% of local voice apps still ship.

2. Specific local-runnable models

End-to-end / unified

Model	Params	Quant	RAM (real)	Local runtime	Status Q1 2026
Moshi (Moshiko/Moshika)	7B	int4 / int8 / bf16	~6 GB int4, ~14 GB bf16	PyTorch, MLX (int4/int8/bf16), Rust/Candle	Apache-2.0; ~200 ms practical latency; full-duplex; English-only
Kyutai TTS / Pocket TTS	100M (Pocket) / 1B+ (full)	fp16	<1 GB (Pocket)	Rust/Candle, MLX	Pocket TTS released Jan 2026, CPU real-time
Mini-Omni / Mini-Omni2	0.5B base + audio heads	fp16 only	~3 GB	PyTorch reference	Research-grade; no GGUF; no real maintained Apple Silicon path
GLM-4-Voice	9B (LM) + tokenizer + decoder	fp16/int4	~12 GB int4	PyTorch	Strong CN/EN; Apache-2.0; no GGUF audio output
Step-Audio	130B teacher / smaller distilled variants	int4	varies	PyTorch	Mostly cloud; smaller variants exist but weights heavy
Qwen2.5-Omni	3B and 7B	Q4_K_M (4.7 GB), Q8_0 (8.1 GB) on 7B	5–10 GB	llama.cpp via `llama-mtmd-cli` (audio IN only), PyTorch (full)	GGUF supports audio + vision input; audio output is NOT supported in llama.cpp as of early 2026
Qwen3-Omni	7B+	varies	larger	PyTorch	End-to-end omni; same llama.cpp limitation expected to persist

Pipeline components

STT (Whisper family via whisper.cpp, all Core ML / ANE accelerated)

Variant	Disk	RAM	Notes
`tiny` / `tiny.en`	75 MB	~1 GB	Real-time on RPi-class hardware
`base` / `base.en`	142 MB	~1 GB	Best speed/quality for live captions
`small` / `small.en`	466 MB	~2 GB	The sweet spot for most apps
`medium` / `medium.en`	1.5 GB	~5 GB	Diminishing returns vs. small.en for English
`large-v3` / `large-v3-turbo`	1.5–3 GB	6–10 GB	Best quality; turbo variants designed for streaming

Whisper.cpp + Core ML encoder gives roughly 3× speedup vs CPU on Apple Silicon (ANE). Locara already ships this.

Newer ASR options worth tracking

Apple SpeechAnalyzer (macOS Tahoe / iOS 26, June 2025): replaces SFSpeechRecognizer, fully on-device, modular (SpeechTranscriber + SpeechDetector), tuned for long-form. Independent benchmarks (Yap, MacRumors test with a 34-min file): ~45 s vs MacWhisper Large-V3-Turbo at ~101 s — ~55% faster with comparable quality. Requires macOS 15.0+.
Moshi STT (extracted Mimi + ASR head): streaming, sub-100 ms partial latency. New in 2025; not as battle-tested as Whisper but real-time-first by design.

TTS — local options

Model	Params	RAM	Quality	Speed
Piper	<50 MB per voice	<500 MB	Good (parametric)	Real-time on CPU; edge-friendly
Kokoro-82M	82M	~500 MB	Very good	<0.3 s for short utterance; ~36× real-time on free Colab GPU; ~30–45 s for 1500 words on M1 Air 8 GB
F5-TTS	~330M	~2 GB	Excellent (zero-shot voice clone)	Slower than Kokoro
XTTS-v2	~750M	~3 GB	Excellent (multilingual + clone)	Coqui project archived; community fork
Orpheus 3B	3B	~6 GB int4	Best emotional range (laugh, cry, whisper)	Heaviest of the bunch; needs GPU/ANE for real-time
Apple `AVSpeechSynthesizer`	(system)	0 (system-managed)	Good with neural voices; great with Personal Voice	Real-time, free, system audio routing handled
Kyutai Pocket TTS	100M	<1 GB	Good	CPU real-time

llama.cpp / GGUF compatibility — concrete answer

llama.cpp supports audio input for Qwen2.5-Omni via llama-mtmd-cli and llama-server (PR #13784 merged mid-2025). It does not support audio output / speech generation for any unified-modality model — the codec-decoder side has not been ported. So Locara cannot get a single-binary llama.cpp pipeline that does end-to-end voice. Pipeline is forced, regardless of which omni-model you pick, if llama.cpp is the only inference engine.

3. Latency and quality trade-offs

The Moshi paper (arXiv 2410.00037) is the canonical reference on E2E latency and explains the win clearly: a 7B language model outputs discrete acoustic codec tokens (Mimi, 12.5 Hz, ~1100 bps) in parallel with text tokens via an “Inner Monologue” stream. Theoretical latency is ~160 ms; measured ~200 ms. Two enabling moves:

Streaming neural audio codec — Mimi encodes/decodes in <40 ms windows, so the first audio token can be played before the full response is decided.
Multi-stream parallel decoding — text and audio share one transformer; speaker turns are not explicit (no “now my turn” handoff), so the model handles barge-in / interruption natively.

Mini-Omni (arXiv 2408.16725) takes a different shortcut — the “Any Model Can Talk” framework adds audio output heads to a pretrained text LM, training only the new heads while preserving language quality. They use SNAC (8 codebooks, hundreds of tokens/sec) with text-instructed delayed parallel generation to avoid drowning the LM in long codebook sequences. Their measured TTFT for audio is ~300 ms — slower than Moshi but with a much smaller base model.

Pipeline latency is bounded below by VAD endpointing (~150 ms minimum to confirm end-of-utterance) plus ASR finalization plus LLM TTFT plus TTS first-chunk. Even with whisper.cpp small.en + Llama-3.2-1B + Kokoro, you’re looking at ~600–900 ms end-to-end on M-series Macs. The thing pipeline gets that E2E does not: independently swappable, independently quantizable components with mature tooling and clear failure modes.

Practical recommendation backed by what’s shipping: pipeline today, with an upgrade path to Moshi-MLX once it stabilizes.

4. Native macOS APIs — what to actually use

The two relevant Apple frameworks both run fully on-device, both are free (no model download), both work in Tauri via a Swift sidecar or objc2 Rust bindings.

API	Use case	Trade-off vs third-party
`SFSpeechRecognizer` (legacy)	Short utterances, command/control	Whisper.cpp wins on long-form quality and is more configurable
`SpeechAnalyzer` + `SpeechTranscriber` (macOS 15+)	Long-form transcription, lectures, meetings	Often beats whisper-large-v3-turbo on speed; comparable quality. Locale-by-locale model download managed by the OS (zero cost to Locara). Strong default for English / supported locales.
`AVSpeechSynthesizer`	TTS for any text	Free, system-routed, supports Personal Voice (with `requestPersonalVoiceAuthorization`). Lower expressive range than Kokoro/Orpheus. Integrates with VoiceOver.
`SpeechDetector`	VAD / endpointing	Replaces hand-rolled energy thresholds; pairs naturally with `SpeechTranscriber`

The right architecture for Locara’s voice-to-voice modality on macOS is therefore: default to SpeechAnalyzer for STT and AVSpeechSynthesizer for TTS, with opt-in upgrades to whisper.cpp and Kokoro/Piper/Moshi-MLX. This mirrors how Locara already uses Apple Vision for OCR by default with a fallback to GLM-OCR / RapidOCR.

5. Wiring design for the manifest

Should `voice-to-voice` be a top-level modality?

Yes — and the existing spec already lists it (spec/04-modalities.md). The expansion needs updating to reflect Q1 2026 reality:

voice-to-voice  →  device.microphone
                +  device.speaker            (NEW — see below)
                +  audio.record + audio.play SDK
                +  one of:
                    (a) STT model + LLM model + TTS model     (pipeline default)
                    (b) unified omni model (e.g., moshi-7b)   (E2E option)
                    (c) Apple SpeechAnalyzer + AVSpeech       (zero-model fallback)
                +  voice.* SDK module

Critically: keep speech-to-text and text-to-speech as separate first-class modalities. Apps that only need one (a transcription tool, a screen reader) shouldn’t pull the whole voice-to-voice expansion. Apps that need full duplex declare voice-to-voice. The expansion is a convenience bundle, not a replacement for the granular modalities.

Capability grant: do we need `device.speaker`?

The W3C Audio Output Devices API and the corresponding Permissions-Policy: speaker-selection directive (W3C Candidate Recommendation Draft, 2025-10-09) treat speaker access as a permissioned feature with a real threat model: a malicious app could blast loud audio out of the laptop’s speakers when the user is wearing headphones, or route audio through unintended output devices. The W3C spec specifically calls out the “library laptop with USB headset” scenario.

For Locara this maps to:

device.speaker: true — grants the right to play audio through the default output device. macOS does not have a TCC permission for audio playback per se, but the principle of declaring intent is consistent with Locara’s other capabilities.
device.speaker.select: true — separately required to enumerate or select non-default audio output devices (extending the W3C model). Probably defer to v2.

The threat model: even without select, an app with speaker: true can be a nuisance (random audio bursts, unwanted TTS). Mitigations:

Runtime audio output is gated through the Locara plugin, which can rate-limit, fade, and respect a global “audio-allowed” toggle.
Per the cool-down rules in spec/03-capabilities.md, adding device.speaker to an existing app on update triggers 7-day cool-down + re-consent.
The voice-to-voice and text-to-speech modality expansions should auto-grant device.speaker, so most apps never see this capability directly.

SDK surface

The existing SDK in spec/05-sdk.md follows a strict pattern: one module per modality, with both one-shot and streaming variants (e.g., transcribe.fromFile + transcribe.stream). The voice modality should follow the same pattern, not invent a new “agent runner” abstraction.

import { voice } from '@locara/sdk'

// Pipeline form (default expansion): explicit, debuggable
const session = voice.session({
  stt: { model: 'apple-speech-analyzer' },        // or 'whisper-large-v3-turbo'
  llm: { model: 'qwen2.5-3b-instruct-q4', system: '...' },
  tts: { model: 'apple-avspeech' },               // or 'kokoro-82m'
})

await session.start()                              // requests mic + speaker grants
for await (const ev of session.events()) {
  // ev: { type: 'partial-transcript' | 'final-transcript' | 'llm-token' |
  //         'audio-chunk' | 'turn-end' | 'barge-in' | 'error' }
}

// E2E form (when an omni model is selected): same shape, fewer events
const session = voice.session({ omni: { model: 'moshi-7b-mlx-int4' } })

Reasons to prefer voice.session({...}) over agent.runVoice(...) or voice.converse(...):

Symmetry with transcribe.live, llm.chatStream, db.transaction — Locara’s SDK is already module-shaped, not agent-shaped.
Inspectability — the manifest pinned which models will be used; the SDK call should reflect those names so static analysis (spec/03-capabilities.md) can verify that referenced models are declared.
The same call shape works for pipeline and E2E — apps don’t have to rewrite their code to upgrade from “Apple defaults” to “Moshi-MLX” once it’s available.

A Float32Array async-iterable for raw audio is the right low-level primitive, but most apps shouldn’t see audio bytes. They should see semantic events (transcripts, model output, turn boundaries). Expose raw audio as session.rawInput() / session.rawOutput() for advanced cases (recording, custom UI visualization).

6. Public local voice-agent demos to mirror

sherpa-onnx-go-macos (k2-fsa) — real-time voice assistant in Go using sherpa-onnx (STT + TTS + VAD), Whisper, Kokoro, Ollama. Pipeline pattern, fully local. Worth reading the audio plumbing.
Carlos Mbendera’s Sherpa-Onnx Swift integration (Medium, 2025) — shows the Swift binding pattern for Apple Silicon; useful for a Tauri Swift sidecar.
Granola voice mode — closed-source but architecturally a pipeline; their UX for partial transcripts + barge-in is widely-copied.
MacWhisper — pipeline only, but the gold standard for “whisper-on-Mac” UX.
Argmax WhisperKit + SpeechAnalyzer comparison — Argmax’s blog explicitly benchmarks the two; the takeaway is that SpeechAnalyzer wins on fresh-install latency but WhisperKit wins on configurability.
Yap (CLI, MacRumors-tested): minimal example of using SpeechAnalyzer for batch transcription — useful as a reference for the SpeechAnalyzer Swift sidecar Locara would write.

No public demo yet uses Moshi end-to-end on Apple Silicon as a daily-driver voice agent — Moshi-MLX exists, but the ecosystem hasn’t shipped a polished consumer-facing app on top of it.

Specific Locara learnings

Ship voice-to-voice as a pipeline-by-default modality in v1, with the expansion picking Apple SpeechAnalyzer (STT) + Llama-3.2-3B-Instruct-q4 (LLM) + AVSpeechSynthesizer (TTS) on macOS 15+, falling back to whisper.cpp + Kokoro on older macOS. This gives a working zero-extra-download voice agent on every supported Mac, then lets opinionated apps override.
Do not block on Moshi/Qwen-Omni for v1. llama.cpp supports omni-model audio input but not audio output as of Q1 2026 (see PR ggml-org/llama.cpp#13784 + Issue #12673). The fastest route to “real” E2E is a separate locara-moshi crate using MLX-Rust or Candle, added in a later milestone. Keep the modality manifest stable so apps don’t have to rewrite when E2E lands.
Add device.speaker as a new capability in spec/03-capabilities.md, mapped to a Locara runtime gate (no native macOS TCC equivalent, but the W3C speaker-selection model gives a clean threat-model story). Both text-to-speech and voice-to-voice modality expansions auto-grant it. Adding device.speaker on update triggers the existing 7-day cool-down rule.
Use the existing voice.session({...}) shape, not an agent.runVoice wrapper. Same call signature must accept either {stt, llm, tts} (pipeline) or {omni} (E2E), so apps don’t fork code paths when upgrading. Emit semantic events (partial-transcript, audio-chunk, turn-end, barge-in) instead of raw audio frames; expose raw frames via rawInput() / rawOutput() for advanced cases.
Default STT should be SpeechAnalyzer on macOS 15+, whisper.cpp small.en + Core ML elsewhere. SpeechAnalyzer is ~55% faster than whisper-large-v3-turbo with comparable quality (per Yap/MacRumors benchmarks), zero model download for the user, and Apple manages locale models. Keep whisper.cpp as the override for cross-version consistency, multilingual support beyond Apple’s locales, and verifiable open-source provenance.
Default TTS should be AVSpeechSynthesizer (with optional Personal Voice). Free, system-routed, accessible, no model download. Reserve Kokoro/Piper/Orpheus for apps that need controlled voice quality or voice cloning — gate those behind explicit model declarations in the manifest, since they cost RAM and disk.
For interruption / barge-in: bake it in. Pipeline implementations can do it via VAD ducking; E2E models (Moshi) handle it natively. Either way, apps should not have to wire it themselves — the voice.session should emit barge-in events and pause TTS automatically. This is the primary UX differentiator vs. naive pipelines.
Plan a locara-voice crate that owns the audio I/O, VAD, and turn-taking state machine — keeping it out of locara-llama and locara-whisper so the same state machine is reused whether the LLM is text-only or omni. The crate’s job is “raw audio in/out + turn boundaries”; the modality expansion wires it to whatever model triple (or single omni model) the manifest names.