speech-to-text (ASR)
HF group: Audio · Status: ✅ shipped
HF aliases: automatic-speech-recognition.
What it is
Audio → text + word/segment timestamps. Foundational for any voice-input app.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Whisper-tiny.en | 39 M | 2022 | MIT | Surprisingly good for size | ~75 MB Q4. |
| Whisper-base.en | 74 M | 2022 | MIT | Default Locara pick | ~150 MB Q4. |
| Whisper-large-v3 | 1.55 B | 2023 | MIT | Best Whisper variant | Multilingual, ~3 GB Q4. |
| Distil-Whisper-large-v3 | ~756 M | 2024 | MIT | 6× faster than large-v3 | Slight quality drop, English. |
| NVIDIA Parakeet-TDT-1.1B | 1.1 B | 2024 | CC-BY-4.0 | Tops the OpenASR leaderboard | English-only. |
| Apple SpeechAnalyzer | n/a | macOS 15+ | Apple | High quality | Native API; zero RAM cost. |
Infrastructure required
Inference
- ✅
locara-whisperwraps whisper.cpp with Metal acceleration. Implementslocara_core::TranscribeBackend. - ❌ Apple SpeechAnalyzer Swift sidecar (BACKLOG item; would zero out the RAM cost on macOS 15+).
- ❌ NVIDIA Parakeet runtime (Apache-2.0; would need separate path — not whisper.cpp).
Input
- ✅ Mic capture via
locara-microphone(Float32 PCM frame stream). - ✅ System-audio capture via
locara-screencapture-audio. - ✅ Resampling to 16 kHz (Whisper’s native rate).
- ❌ VAD-aware streaming segmentation — BACKLOG: pair with Silero VAD.
Output
- ✅ Streaming text events: partial transcript + final segment with timestamps.
- ✅ Optional word-level timestamps (used by
apps/transcribe).
Storage
- ✅ Weights via
locara-models::Cache. - ✅ Session state held in
TranscribeBackendbetweenstream_startandstream_stop. - App-side: transcripts persisted via
locara-storage(per-app SQLite).
Interaction (IPC + SDK)
- ✅ IPC:
transcribe.from_pcm(one-shot),transcribe.stream_start,transcribe.stream_push,transcribe.stream_stop. - ✅ SDK:
transcribe.fromFile(path),transcribe.fromMic({ source })inpackages/sdk/src/transcribe.ts. - ✅ Wire format: PCM frames base64-encoded for the streaming push (
samples_b64).
Capabilities (manifest)
- ✅
capabilities.device.microphone: truefor live ASR. - ✅
capabilities.fs.user-selected: truefor “transcribe a file the user picks”. - ✅
capabilities.models[]lists the Whisper model. - ✅ Modality declaration:
"modalities": ["speech-to-text"].
Gaps
- Apple SpeechAnalyzer Swift-sidecar integration (BACKLOG: “Wire
apple-speech-analyzeras default STT on macOS 15+”). - NVIDIA Parakeet for users wanting a higher-quality non-Whisper pick.
- Pairing with a real
voice-activity-detectionmodel (Silero VAD) for cleaner segmentation.
See also
voice-activity-detectionaudio-text-to-textvoice-to-voice- Crates:
locara-whisper,locara-microphone,locara-screencapture-audio - App:
apps/transcribe - Index:
../modalities-and-models-survey.md