Locara

speech-to-text (ASR)

HF group: Audio · Status: ✅ shipped

HF aliases: automatic-speech-recognition.

What it is

Audio → text + word/segment timestamps. Foundational for any voice-input app.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Whisper-tiny.en39 M2022MITSurprisingly good for size~75 MB Q4.
Whisper-base.en74 M2022MITDefault Locara pick~150 MB Q4.
Whisper-large-v31.55 B2023MITBest Whisper variantMultilingual, ~3 GB Q4.
Distil-Whisper-large-v3~756 M2024MIT6× faster than large-v3Slight quality drop, English.
NVIDIA Parakeet-TDT-1.1B1.1 B2024CC-BY-4.0Tops the OpenASR leaderboardEnglish-only.
Apple SpeechAnalyzern/amacOS 15+AppleHigh qualityNative API; zero RAM cost.

Infrastructure required

Inference

  • locara-whisper wraps whisper.cpp with Metal acceleration. Implements locara_core::TranscribeBackend.
  • ❌ Apple SpeechAnalyzer Swift sidecar (BACKLOG item; would zero out the RAM cost on macOS 15+).
  • ❌ NVIDIA Parakeet runtime (Apache-2.0; would need separate path — not whisper.cpp).

Input

  • ✅ Mic capture via locara-microphone (Float32 PCM frame stream).
  • ✅ System-audio capture via locara-screencapture-audio.
  • ✅ Resampling to 16 kHz (Whisper’s native rate).
  • ❌ VAD-aware streaming segmentation — BACKLOG: pair with Silero VAD.

Output

  • ✅ Streaming text events: partial transcript + final segment with timestamps.
  • ✅ Optional word-level timestamps (used by apps/transcribe).

Storage

  • ✅ Weights via locara-models::Cache.
  • ✅ Session state held in TranscribeBackend between stream_start and stream_stop.
  • App-side: transcripts persisted via locara-storage (per-app SQLite).

Interaction (IPC + SDK)

  • ✅ IPC: transcribe.from_pcm (one-shot), transcribe.stream_start, transcribe.stream_push, transcribe.stream_stop.
  • ✅ SDK: transcribe.fromFile(path), transcribe.fromMic({ source }) in packages/sdk/src/transcribe.ts.
  • ✅ Wire format: PCM frames base64-encoded for the streaming push (samples_b64).

Capabilities (manifest)

  • capabilities.device.microphone: true for live ASR.
  • capabilities.fs.user-selected: true for “transcribe a file the user picks”.
  • capabilities.models[] lists the Whisper model.
  • ✅ Modality declaration: "modalities": ["speech-to-text"].

Gaps

  • Apple SpeechAnalyzer Swift-sidecar integration (BACKLOG: “Wire apple-speech-analyzer as default STT on macOS 15+”).
  • NVIDIA Parakeet for users wanting a higher-quality non-Whisper pick.
  • Pairing with a real voice-activity-detection model (Silero VAD) for cleaner segmentation.

See also