`speech-to-text` (ASR)

HF group: Audio · Status: ✅ shipped

HF aliases: automatic-speech-recognition.

What it is

Audio → text + word/segment timestamps. Foundational for any voice-input app.

Open-weight models

Model	Params	Released	License	Quality	Notes
Whisper-tiny.en	39 M	2022	MIT	Surprisingly good for size	~75 MB Q4.
Whisper-base.en	74 M	2022	MIT	Default Locara pick	~150 MB Q4.
Whisper-large-v3	1.55 B	2023	MIT	Best Whisper variant	Multilingual, ~3 GB Q4.
Distil-Whisper-large-v3	~756 M	2024	MIT	6× faster than large-v3	Slight quality drop, English.
NVIDIA Parakeet-TDT-1.1B	1.1 B	2024	CC-BY-4.0	Tops the OpenASR leaderboard	English-only.
Apple SpeechAnalyzer	n/a	macOS 15+	Apple	High quality	Native API; zero RAM cost.

Infrastructure required

Inference

✅ locara-whisper wraps whisper.cpp with Metal acceleration. Implements locara_core::TranscribeBackend.
❌ Apple SpeechAnalyzer Swift sidecar (BACKLOG item; would zero out the RAM cost on macOS 15+).
❌ NVIDIA Parakeet runtime (Apache-2.0; would need separate path — not whisper.cpp).

Input

✅ Mic capture via locara-microphone (Float32 PCM frame stream).
✅ System-audio capture via locara-screencapture-audio.
✅ Resampling to 16 kHz (Whisper’s native rate).
❌ VAD-aware streaming segmentation — BACKLOG: pair with Silero VAD.

Output

✅ Streaming text events: partial transcript + final segment with timestamps.
✅ Optional word-level timestamps (used by apps/transcribe).

Storage

✅ Weights via locara-models::Cache.
✅ Session state held in TranscribeBackend between stream_start and stream_stop.
App-side: transcripts persisted via locara-storage (per-app SQLite).

Interaction (IPC + SDK)

✅ IPC: transcribe.from_pcm (one-shot), transcribe.stream_start, transcribe.stream_push, transcribe.stream_stop.
✅ SDK: transcribe.fromFile(path), transcribe.fromMic({ source }) in packages/sdk/src/transcribe.ts.
✅ Wire format: PCM frames base64-encoded for the streaming push (samples_b64).

Capabilities (manifest)

✅ capabilities.device.microphone: true for live ASR.
✅ capabilities.fs.user-selected: true for “transcribe a file the user picks”.
✅ capabilities.models[] lists the Whisper model.
✅ Modality declaration: "modalities": ["speech-to-text"].

Gaps

Apple SpeechAnalyzer Swift-sidecar integration (BACKLOG: “Wire apple-speech-analyzer as default STT on macOS 15+”).
NVIDIA Parakeet for users wanting a higher-quality non-Whisper pick.
Pairing with a real voice-activity-detection model (Silero VAD) for cleaner segmentation.

See also

voice-activity-detection
audio-text-to-text
voice-to-voice
Crates: locara-whisper, locara-microphone, locara-screencapture-audio
App: apps/transcribe
Index: ../modalities-and-models-survey.md