Locara

Whisper and the Local STT Landscape

What it is: A survey of speech-to-text (STT) options for local execution on Apple Silicon, anchored on OpenAI Whisper (the open-weights model family) and the runtimes around it: whisper.cpp (ggerganov’s C++ port), WhisperKit (Argmax’s CoreML-optimized version), FluidAudio CoreML, and Apple SpeechAnalyzer (the new system-level API in macOS 26). Status: Multiple competing runtimes, all maturing fast in 2025–26. WER on benchmark clean audio for Whisper Large-v3 is ~2.7%, with 8–12% real-world WER on conversational speech. Apple’s SpeechAnalyzer matches mid-tier Whisper. Voxtral (Meta, 2025) is the newest open-weights STT entrant. Most relevant to Locara: Locara’s phase-1 reference app Transcribe rides directly on this stack. The choice of runtime — and whether to depend on Apple’s system API or a portable open-weights runtime — is one of the highest-stakes engineering decisions in v1.

Background

OpenAI released Whisper in September 2022 as an open-weights multilingual STT model trained on 680,000 hours of labeled audio. Models came in sizes from tiny (39M params) to large (1.55B), with subsequent updates to large-v2 (Dec 2022), large-v3 (Nov 2023), and large-v3-turbo (Sept 2024 — distilled, faster, slight quality cost).

Within weeks of Whisper’s release, ggerganov released whisper.cpp as a C/C++ port using GGML (the same tensor library that became the substrate for llama.cpp). whisper.cpp added Apple Neural Engine acceleration via CoreML in 2023, claiming >3× speedup vs CPU-only. By 2026 it’s the most-portable Whisper runtime — CPU (AVX/NEON), Metal, CUDA, OpenCL, Vulkan, and CoreML/ANE backends in one codebase.

WhisperKit (from Argmax, founded by ex-Apple engineers) is a Swift-native, CoreML-optimized Whisper implementation released in 2024. It targets Apple Silicon specifically, leveraging ANE more aggressively than whisper.cpp’s CoreML path. FluidAudio CoreML is a newer entrant in similar territory.

In 2025, Apple announced SpeechAnalyzer as part of macOS 26 — a system-level API for high-quality on-device speech recognition. Argmax benchmarked SpeechAnalyzer as matching mid-tier OpenAI Whisper models on long-form conversational speech, on M4 Mac mini hardware.

In parallel, Voxtral (Meta, 2025) is a newer open-weights STT model competing with Whisper on quality and supporting more languages.

The current options for a Locara Mac app (2026)

RuntimeSubstrateApple Silicon perfCross-platformLocara fit
whisper.cpp + CoreMLC/C++ + ggmlGood (3× CPU)Yes (every platform)Default fallback; portable
WhisperKitSwift + CoreMLBest (16% over whisper.cpp on some configs)No (Apple-only)Best for Mac-native apps
FluidAudio CoreMLSwift + CoreMLComparable to WhisperKitNoAlternative; smaller community
Apple SpeechAnalyzerSystem APIComparable to mid-WhisperNoZero binary cost; system-level
MLX-Whisper (mlx-examples)MLXUnderexploredNoPromising; smaller footprint
OpenAI Python whisperPyTorch / MPSSlow (~7–9 tok/s on M-series)YesReference quality; not for production

Key tradeoffs

  • Latency vs. quality. tiny/base runs in real-time on a phone; large-v3 requires a desktop-class GPU/ANE for real-time. Most Locara use cases want small/medium-class quality at real-time on M-series.
  • Streaming vs. file-mode. Whisper natively transcribes 30-second windows; real-time streaming requires VAD + chunking. whisper.cpp has streaming wrappers; WhisperKit ships streaming first-class.
  • Diarization gap. Vanilla Whisper doesn’t speaker-label. Add-on systems (pyannote.audio, WhisperX) handle this but increase complexity. Apple SpeechAnalyzer reportedly includes diarization in its newer versions.
  • Model size on disk. large-v3 is ~3 GB FP16, ~1.5 GB Q5, ~700 MB Q4. Real-world Mac apps usually ship small (~250 MB) or medium (~770 MB).
  • Language coverage. Whisper supports 99 languages; quality drops sharply outside top 20. Apple SpeechAnalyzer is initially English-strong; Voxtral targets broader multilingual parity.
  • WER baseline. Whisper Large-v3: ~2.7% on librispeech-clean, ~8–12% on conversational real-world speech.

What worked

  • Whisper’s open release rebooted the STT space. Pre-Whisper, high-quality STT meant cloud APIs (Google Speech, AWS Transcribe). Whisper put state-of-the-art on a laptop overnight.
  • whisper.cpp’s portability made Whisper usable on every device class within months of its release.
  • CoreML acceleration via the ANE turned Apple Silicon into the best-per-watt Whisper platform.
  • WhisperKit’s Swift-native API filled the gap for Mac-native apps that wanted CoreML without C++ FFI gymnastics.
  • Apple SpeechAnalyzer provides a “good enough” STT with zero binary cost — apps can use it for casual transcription without shipping any model.
  • The streaming wrapper ecosystem (whisper-streaming, whisper-live, etc.) made real-time transcription viable.

What failed / criticisms

  • Hallucinations on silence and music. Whisper invents plausible-sounding text when the audio has no speech, especially in large-v3. Real-world bug source.
  • Diarization is still a gap. Most local Whisper deployments don’t speaker-label cleanly without add-on systems.
  • Languages outside top 20 are weak. Quality drops sharply.
  • Streaming is a bolt-on, not native. Whisper’s design assumes 30-second windows; real-time wrappers do clever chunking but introduce latency and edge cases.
  • Quality varies with the runtime. The same Whisper weights produce subtly different results across whisper.cpp, WhisperKit, and the reference Python. Numerical drift, FP16 accumulation, ANE quirks all contribute.
  • WhisperKit and FluidAudio fragment the Apple-Silicon space. Two competing CoreML-optimized runtimes both targeting the same niche.
  • Model file sizes are still substantial. large is too big for many consumer apps; small and medium are workable but have a quality ceiling.

Specific learnings for Locara

  1. Default to WhisperKit on Apple Silicon, with whisper.cpp as the cross-platform fallback. Same MLX-default-with-llama.cpp-fallback pattern, applied to STT. WhisperKit is the throughput leader on Mac; whisper.cpp gets you Linux/Windows when v2 needs them.
  2. Don’t depend on Apple SpeechAnalyzer for the Transcribe app. It’s tempting (zero binary cost, system-level), but: (a) it ties you to macOS 26+, (b) it’s closed, (c) it doesn’t support the customization a Locara app might want (custom vocabularies, fine-tuning, model-version pinning). Use it as an option the app can declare in its manifest, not as the default.
  3. Pin a specific Whisper model version per device class. Like the LLM model pinning — whisper-large-v3-turbo-q5 for M2+ Pro/Max with adequate disk; whisper-medium-q5 for base M-series; whisper-small-q5 for older Intel Macs in v2. The manifest should declare device-class targets.
  4. Diarization is a separate capability — declare it. Locara’s Transcribe app should manifest requires.audioSpeakerDiarization: true if it does it, with the runtime selecting an appropriate diarization stack (pyannote, WhisperX, or whatever shipped). Don’t bake it implicitly into the STT capability.
  5. Streaming is a UX feature, not a research project. Use a proven streaming wrapper (whisper.cpp’s stream binary, WhisperKit’s streaming API). Don’t build a custom chunker for v1.
  6. Hallucinations on silence are a real bug to design against. Locara’s Transcribe should detect and suppress hallucinated outputs from silence/music — this is a known failure mode that needs explicit handling, not “ignore and hope.”
  7. Use ANE acceleration, but treat it as opportunistic. ANE gives a ~3× speedup when it’s available and the model fits. Code defensively — fall back to GPU/CPU paths cleanly if ANE is busy or the model variant isn’t supported.
  8. Quantize aggressively for STT. Whisper Q5 is nearly indistinguishable from FP16 for typical use; Q4 is acceptable for most. Disk savings are meaningful (3 GB → 700 MB for large).
  9. Be honest about WER in the UX. Real-world conversational WER is 8–12%, not benchmark 2.7%. Surface uncertainty in the UI (e.g., low-confidence words highlighted) rather than presenting transcripts as ground truth.
  10. Watch Voxtral and Apple’s roadmap. STT is evolving fast; Locara’s runtime should be able to swap STT models without app-author changes. The manifest abstraction is the right level — apps declare “I need real-time STT with diarization in en+es,” runtime picks the model.

References