Locara

voice-to-voice (full duplex)

HF group: Audio · Status: 🟑 partial (Moshi shipping, quality items in BACKLOG)

What it is

Live mic in, audio + text out, full-duplex, sub-second latency. Either as a true end-to-end audio language model (Moshi-class) or as a STT→LLM→TTS pipeline.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Kyutai Moshi (moshiko / moshika)7 B2024-09Apache-2.0First open full-duplex audio LM~200 ms first-audio latency. Hardware-bound on M1/M2.
NVIDIA PersonaPlex 7B7 B2024-12NVIDIA Open ModelPersona-conditionedSkeleton crate, integration TBD.
Qwen3.5-Omni~30 B / 3 B active2026-03Apache-2.036-language speech-out, native turn-takingSubsumes voice-to-voice as part of any-to-any.

Infrastructure required

Inference

  • 🟑 End-to-end backend via locara-moshi (subprocess pattern, drives moshi_mlx Python helper at 80 ms frames @ 24 kHz).
  • βœ… Pipeline backend via locara-voice-pipeline (composes Whisper STT + Llama + macOS say).
  • ⏳ locara-personaplex skeleton (NVIDIA backend, not wired).
  • βœ… Backend dispatcher (MultiVoiceBackend) lets apps pick the variant; routes by moshi- / personaplex- / voice-pipeline- model-id prefix.

Input

  • βœ… Mic capture via locara-microphone (cross-platform Float32 PCM frame stream).
  • βœ… System-audio capture via locara-screencapture-audio (macOS ScreenCaptureKit).
  • βœ… Linear resample to 24 kHz in locara-moshi (note: naive β€” BACKLOG item to switch to native 24 kHz capture or rubato polyphase).
  • ❌ VAD for clean end-of-utterance / barge-in detection (Silero VAD pending β€” see voice-activity-detection).

Output

  • 🟑 Audio playback queue with 300 ms pre-roll jitter buffer + diagnostic underflow warnings (apps/voice/src/audio-playback.ts). Currently in-app, not factored to SDK.
  • ❌ Time-stretching playback (AudioWorklet WSOLA) for hardware-edge cases β€” BACKLOG.
  • βœ… AudioContext sample-rate matched to source (24 kHz native, skips Web Audio’s per-source-node resample).
  • βœ… Streaming text events alongside audio (assistant_partial, turn_end).

Storage

  • βœ… Weights via locara-models::Cache (Moshi weights at ~/Library/Caches/Locara/models/moshi/moshiko-mlx-q4).
  • 🟑 Per-session state in MoshiBackend::sessions HashMap (residual buffer, source rate). Helper subprocess holds model state in Python heap.
  • ❌ Helper persistence between sessions (currently we spawn fresh β€” pays 3 s warmup on every click; BACKLOG).

Interaction (IPC + SDK)

  • βœ… IPC: voice.session_start, voice.session_push, voice.session_events (streaming Channel<VoiceWireEvent>), voice.session_stop.
  • βœ… SDK: voice.session({ omni: { model } }) returns a session object with start(), push(), stop(), events(). Pipeline form (voice.session({ stt, llm, tts })) shares the same surface.
  • βœ… Wire format: f32 PCM samples as JSON Vec<f32> (small chunks; profiled β€” not the bottleneck).

Capabilities (manifest)

  • βœ… capabilities.device.microphone: true.
  • ❌ capabilities.device.speaker cool-down semantics not yet enforced (BACKLOG).
  • βœ… capabilities.models[] includes the omni model id β€” accepts the new name@variant form (e.g. moshi-7b@kyutai-moshiko) alongside name@sha256:HASH.
  • βœ… Modality declaration: "modalities": ["voice-to-voice"] in locara.json.

Gaps

  • Time-stretching playback worklet is the proper fix for hardware-edge cases when the model is slower than real-time (current measurement: p50 86 ms inter-frame on M5 Pro, target 80 ms β€” 6 ms over budget).
  • Helper persistence between sessions (cuts 3 s warmup every-time hit).
  • Weights auto-download flow (currently manual hf download).
  • Silero VAD wiring for barge-in and clean end-of-utterance.
  • All in BACKLOG.

See also