`voice-to-voice` (full duplex)

HF group: Audio · Status: 🟡 partial (Moshi shipping, quality items in BACKLOG)

What it is

Live mic in, audio + text out, full-duplex, sub-second latency. Either as a true end-to-end audio language model (Moshi-class) or as a STT→LLM→TTS pipeline.

Open-weight models

Model	Params	Released	License	Quality	Notes
Kyutai Moshi (moshiko / moshika)	7 B	2024-09	Apache-2.0	First open full-duplex audio LM	~200 ms first-audio latency. Hardware-bound on M1/M2.
NVIDIA PersonaPlex 7B	7 B	2024-12	NVIDIA Open Model	Persona-conditioned	Skeleton crate, integration TBD.
Qwen3.5-Omni	~30 B / 3 B active	2026-03	Apache-2.0	36-language speech-out, native turn-taking	Subsumes voice-to-voice as part of `any-to-any`.

Infrastructure required

Inference

🟡 End-to-end backend via locara-moshi (subprocess pattern, drives moshi_mlx Python helper at 80 ms frames @ 24 kHz).
✅ Pipeline backend via locara-voice-pipeline (composes Whisper STT + Llama + macOS say).
⏳ locara-personaplex skeleton (NVIDIA backend, not wired).
✅ Backend dispatcher (MultiVoiceBackend) lets apps pick the variant; routes by moshi- / personaplex- / voice-pipeline- model-id prefix.

Input

✅ Mic capture via locara-microphone (cross-platform Float32 PCM frame stream).
✅ System-audio capture via locara-screencapture-audio (macOS ScreenCaptureKit).
✅ Linear resample to 24 kHz in locara-moshi (note: naive — BACKLOG item to switch to native 24 kHz capture or rubato polyphase).
❌ VAD for clean end-of-utterance / barge-in detection (Silero VAD pending — see voice-activity-detection).

Output

🟡 Audio playback queue with 300 ms pre-roll jitter buffer + diagnostic underflow warnings (apps/voice/src/audio-playback.ts). Currently in-app, not factored to SDK.
❌ Time-stretching playback (AudioWorklet WSOLA) for hardware-edge cases — BACKLOG.
✅ AudioContext sample-rate matched to source (24 kHz native, skips Web Audio’s per-source-node resample).
✅ Streaming text events alongside audio (assistant_partial, turn_end).

Storage

✅ Weights via locara-models::Cache (Moshi weights at ~/Library/Caches/Locara/models/moshi/moshiko-mlx-q4).
🟡 Per-session state in MoshiBackend::sessions HashMap (residual buffer, source rate). Helper subprocess holds model state in Python heap.
❌ Helper persistence between sessions (currently we spawn fresh — pays 3 s warmup on every click; BACKLOG).

Interaction (IPC + SDK)

✅ IPC: voice.session_start, voice.session_push, voice.session_events (streaming Channel<VoiceWireEvent>), voice.session_stop.
✅ SDK: voice.session({ omni: { model } }) returns a session object with start(), push(), stop(), events(). Pipeline form (voice.session({ stt, llm, tts })) shares the same surface.
✅ Wire format: f32 PCM samples as JSON Vec<f32> (small chunks; profiled — not the bottleneck).

Capabilities (manifest)

✅ capabilities.device.microphone: true.
❌ capabilities.device.speaker cool-down semantics not yet enforced (BACKLOG).
✅ capabilities.models[] includes the omni model id — accepts the new name@variant form (e.g. moshi-7b@kyutai-moshiko) alongside name@sha256:HASH.
✅ Modality declaration: "modalities": ["voice-to-voice"] in locara.json.

Gaps

Time-stretching playback worklet is the proper fix for hardware-edge cases when the model is slower than real-time (current measurement: p50 86 ms inter-frame on M5 Pro, target 80 ms — 6 ms over budget).
Helper persistence between sessions (cuts 3 s warmup every-time hit).
Weights auto-download flow (currently manual hf download).
Silero VAD wiring for barge-in and clean end-of-utterance.
All in BACKLOG.

See also

speech-to-text — pipeline component
text-to-speech — pipeline component
audio-text-to-text — overlapping (Q&A about audio without speech-out)
voice-activity-detection — needed for clean barge-in
any-to-any — superset, Qwen3.5-Omni
Crates: locara-moshi, locara-voice-pipeline, locara-personaplex, locara-microphone
Notes: notes/voice-to-voice-slms.md
Index: ../modalities-and-models-survey.md