voice-activity-detection
HF group: Audio · Status: ❌ not built · Tier 1 (high leverage)
What it is
Audio → time spans where speech occurs. The first stage of any robust speech pipeline — segments mic input into speech vs silence so downstream stages (ASR, omni LMs) only process useful audio and end-of-utterance is clean.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Silero VAD | ~1 M | 2021 | MIT | The default everywhere | Runs in milliseconds, MIT, no telemetry, trained on 6000+ languages. |
| Pyannote VAD 3.1 | ~6 M | 2024 | MIT | Stronger on noisy / overlapped speech | Heavier; benefits from GPU. |
| WebRTC VAD | n/a | 2011 | BSD-3 | Classic; fast; lower quality | Pre-deep-learning. |
Infrastructure required
Inference
- ❌ Lightweight encoder inference (Silero is 1 M params — could be ONNX, TensorFlow Lite, or even a hand-rolled small kernel).
Input
- ✅ Audio capture (already shipped via
locara-microphone).
Output
- ❌ Speech-segment timestamps (start/end ms).
- Streaming variant emits events as speech starts/stops.
Storage
- ❌ Weights cache (tiny — 1 M params).
- No per-session state needed (purely streaming).
Interaction (IPC + SDK)
- ❌
audio.vad({ audio })IPC, OR wire intotranscribe.stream_*for VAD-aware streaming. - For voice-to-voice: integrated as a pre-processor on the mic stream.
Capabilities (manifest)
capabilities.device.microphone.capabilities.models[]for the VAD model.
Gaps
Silero VAD specifically is the missing primitive for voice apps. Adding it would let Moshi (and future voice apps) detect end-of- utterance and barge-in cleanly. Tiny model (~1 M params), MIT, no friction. Tier 1 BACKLOG.
See also
speech-to-text— VAD feeds invoice-to-voice— barge-in detectionaudio-classification- Crates:
locara-microphone,locara-whisper,locara-moshi - Index:
../modalities-and-models-survey.md