Locara

voice-activity-detection

HF group: Audio · Status: ❌ not built · Tier 1 (high leverage)

What it is

Audio → time spans where speech occurs. The first stage of any robust speech pipeline — segments mic input into speech vs silence so downstream stages (ASR, omni LMs) only process useful audio and end-of-utterance is clean.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Silero VAD~1 M2021MITThe default everywhereRuns in milliseconds, MIT, no telemetry, trained on 6000+ languages.
Pyannote VAD 3.1~6 M2024MITStronger on noisy / overlapped speechHeavier; benefits from GPU.
WebRTC VADn/a2011BSD-3Classic; fast; lower qualityPre-deep-learning.

Infrastructure required

Inference

  • ❌ Lightweight encoder inference (Silero is 1 M params — could be ONNX, TensorFlow Lite, or even a hand-rolled small kernel).

Input

  • ✅ Audio capture (already shipped via locara-microphone).

Output

  • ❌ Speech-segment timestamps (start/end ms).
  • Streaming variant emits events as speech starts/stops.

Storage

  • ❌ Weights cache (tiny — 1 M params).
  • No per-session state needed (purely streaming).

Interaction (IPC + SDK)

  • audio.vad({ audio }) IPC, OR wire into transcribe.stream_* for VAD-aware streaming.
  • For voice-to-voice: integrated as a pre-processor on the mic stream.

Capabilities (manifest)

  • capabilities.device.microphone.
  • capabilities.models[] for the VAD model.

Gaps

Silero VAD specifically is the missing primitive for voice apps. Adding it would let Moshi (and future voice apps) detect end-of- utterance and barge-in cleanly. Tiny model (~1 M params), MIT, no friction. Tier 1 BACKLOG.

See also