`voice-activity-detection`

HF group: Audio · Status: ❌ not built · Tier 1 (high leverage)

What it is

Audio → time spans where speech occurs. The first stage of any robust speech pipeline — segments mic input into speech vs silence so downstream stages (ASR, omni LMs) only process useful audio and end-of-utterance is clean.

Open-weight models

Model	Params	Released	License	Quality	Notes
Silero VAD	~1 M	2021	MIT	The default everywhere	Runs in milliseconds, MIT, no telemetry, trained on 6000+ languages.
Pyannote VAD 3.1	~6 M	2024	MIT	Stronger on noisy / overlapped speech	Heavier; benefits from GPU.
WebRTC VAD	n/a	2011	BSD-3	Classic; fast; lower quality	Pre-deep-learning.

Infrastructure required

Inference

❌ Lightweight encoder inference (Silero is 1 M params — could be ONNX, TensorFlow Lite, or even a hand-rolled small kernel).

Input

✅ Audio capture (already shipped via locara-microphone).

Output

❌ Speech-segment timestamps (start/end ms).
Streaming variant emits events as speech starts/stops.

Storage

❌ Weights cache (tiny — 1 M params).
No per-session state needed (purely streaming).

Interaction (IPC + SDK)

❌ audio.vad({ audio }) IPC, OR wire into transcribe.stream_* for VAD-aware streaming.
For voice-to-voice: integrated as a pre-processor on the mic stream.

Capabilities (manifest)

capabilities.device.microphone.
capabilities.models[] for the VAD model.

Gaps

Silero VAD specifically is the missing primitive for voice apps. Adding it would let Moshi (and future voice apps) detect end-of- utterance and barge-in cleanly. Tiny model (~1 M params), MIT, no friction. Tier 1 BACKLOG.

voice-activity-detection