voice-to-voice (full duplex)
HF group: Audio Β· Status: π‘ partial (Moshi shipping, quality items in BACKLOG)
What it is
Live mic in, audio + text out, full-duplex, sub-second latency. Either as a true end-to-end audio language model (Moshi-class) or as a STTβLLMβTTS pipeline.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Kyutai Moshi (moshiko / moshika) | 7 B | 2024-09 | Apache-2.0 | First open full-duplex audio LM | ~200 ms first-audio latency. Hardware-bound on M1/M2. |
| NVIDIA PersonaPlex 7B | 7 B | 2024-12 | NVIDIA Open Model | Persona-conditioned | Skeleton crate, integration TBD. |
| Qwen3.5-Omni | ~30 B / 3 B active | 2026-03 | Apache-2.0 | 36-language speech-out, native turn-taking | Subsumes voice-to-voice as part of any-to-any. |
Infrastructure required
Inference
- π‘ End-to-end backend via
locara-moshi(subprocess pattern, drivesmoshi_mlxPython helper at 80 ms frames @ 24 kHz). - β
Pipeline backend via
locara-voice-pipeline(composes Whisper STT + Llama + macOSsay). - β³
locara-personaplexskeleton (NVIDIA backend, not wired). - β
Backend dispatcher (
MultiVoiceBackend) lets apps pick the variant; routes bymoshi-/personaplex-/voice-pipeline-model-id prefix.
Input
- β
Mic capture via
locara-microphone(cross-platform Float32 PCM frame stream). - β
System-audio capture via
locara-screencapture-audio(macOS ScreenCaptureKit). - β
Linear resample to 24 kHz in
locara-moshi(note: naive β BACKLOG item to switch to native 24 kHz capture orrubatopolyphase). - β VAD for clean end-of-utterance / barge-in detection (Silero VAD pending β see
voice-activity-detection).
Output
- π‘ Audio playback queue with 300 ms pre-roll jitter buffer + diagnostic underflow warnings (
apps/voice/src/audio-playback.ts). Currently in-app, not factored to SDK. - β Time-stretching playback (AudioWorklet WSOLA) for hardware-edge cases β BACKLOG.
- β AudioContext sample-rate matched to source (24 kHz native, skips Web Audioβs per-source-node resample).
- β Streaming text events alongside audio (assistant_partial, turn_end).
Storage
- β
Weights via
locara-models::Cache(Moshi weights at~/Library/Caches/Locara/models/moshi/moshiko-mlx-q4). - π‘ Per-session state in
MoshiBackend::sessionsHashMap (residual buffer, source rate). Helper subprocess holds model state in Python heap. - β Helper persistence between sessions (currently we spawn fresh β pays 3 s warmup on every click; BACKLOG).
Interaction (IPC + SDK)
- β
IPC:
voice.session_start,voice.session_push,voice.session_events(streamingChannel<VoiceWireEvent>),voice.session_stop. - β
SDK:
voice.session({ omni: { model } })returns a session object withstart(),push(),stop(),events(). Pipeline form (voice.session({ stt, llm, tts })) shares the same surface. - β
Wire format: f32 PCM samples as JSON
Vec<f32>(small chunks; profiled β not the bottleneck).
Capabilities (manifest)
- β
capabilities.device.microphone: true. - β
capabilities.device.speakercool-down semantics not yet enforced (BACKLOG). - β
capabilities.models[]includes the omni model id β accepts the newname@variantform (e.g.moshi-7b@kyutai-moshiko) alongsidename@sha256:HASH. - β
Modality declaration:
"modalities": ["voice-to-voice"]inlocara.json.
Gaps
- Time-stretching playback worklet is the proper fix for hardware-edge cases when the model is slower than real-time (current measurement: p50 86 ms inter-frame on M5 Pro, target 80 ms β 6 ms over budget).
- Helper persistence between sessions (cuts 3 s warmup every-time hit).
- Weights auto-download flow (currently manual
hf download). - Silero VAD wiring for barge-in and clean end-of-utterance.
- All in BACKLOG.
See also
speech-to-textβ pipeline componenttext-to-speechβ pipeline componentaudio-text-to-textβ overlapping (Q&A about audio without speech-out)voice-activity-detectionβ needed for clean barge-inany-to-anyβ superset, Qwen3.5-Omni- Crates:
locara-moshi,locara-voice-pipeline,locara-personaplex,locara-microphone - Notes:
notes/voice-to-voice-slms.md - Index:
../modalities-and-models-survey.md