Whisper and the Local STT Landscape
What it is: A survey of speech-to-text (STT) options for local execution on Apple Silicon, anchored on OpenAI Whisper (the open-weights model family) and the runtimes around it: whisper.cpp (ggerganov’s C++ port), WhisperKit (Argmax’s CoreML-optimized version), FluidAudio CoreML, and Apple SpeechAnalyzer (the new system-level API in macOS 26). Status: Multiple competing runtimes, all maturing fast in 2025–26. WER on benchmark clean audio for Whisper Large-v3 is ~2.7%, with 8–12% real-world WER on conversational speech. Apple’s SpeechAnalyzer matches mid-tier Whisper. Voxtral (Meta, 2025) is the newest open-weights STT entrant. Most relevant to Locara: Locara’s phase-1 reference app Transcribe rides directly on this stack. The choice of runtime — and whether to depend on Apple’s system API or a portable open-weights runtime — is one of the highest-stakes engineering decisions in v1.
Background
OpenAI released Whisper in September 2022 as an open-weights multilingual STT model trained on 680,000 hours of labeled audio. Models came in sizes from tiny (39M params) to large (1.55B), with subsequent updates to large-v2 (Dec 2022), large-v3 (Nov 2023), and large-v3-turbo (Sept 2024 — distilled, faster, slight quality cost).
Within weeks of Whisper’s release, ggerganov released whisper.cpp as a C/C++ port using GGML (the same tensor library that became the substrate for llama.cpp). whisper.cpp added Apple Neural Engine acceleration via CoreML in 2023, claiming >3× speedup vs CPU-only. By 2026 it’s the most-portable Whisper runtime — CPU (AVX/NEON), Metal, CUDA, OpenCL, Vulkan, and CoreML/ANE backends in one codebase.
WhisperKit (from Argmax, founded by ex-Apple engineers) is a Swift-native, CoreML-optimized Whisper implementation released in 2024. It targets Apple Silicon specifically, leveraging ANE more aggressively than whisper.cpp’s CoreML path. FluidAudio CoreML is a newer entrant in similar territory.
In 2025, Apple announced SpeechAnalyzer as part of macOS 26 — a system-level API for high-quality on-device speech recognition. Argmax benchmarked SpeechAnalyzer as matching mid-tier OpenAI Whisper models on long-form conversational speech, on M4 Mac mini hardware.
In parallel, Voxtral (Meta, 2025) is a newer open-weights STT model competing with Whisper on quality and supporting more languages.
The current options for a Locara Mac app (2026)
| Runtime | Substrate | Apple Silicon perf | Cross-platform | Locara fit |
|---|---|---|---|---|
| whisper.cpp + CoreML | C/C++ + ggml | Good (3× CPU) | Yes (every platform) | Default fallback; portable |
| WhisperKit | Swift + CoreML | Best (16% over whisper.cpp on some configs) | No (Apple-only) | Best for Mac-native apps |
| FluidAudio CoreML | Swift + CoreML | Comparable to WhisperKit | No | Alternative; smaller community |
| Apple SpeechAnalyzer | System API | Comparable to mid-Whisper | No | Zero binary cost; system-level |
| MLX-Whisper (mlx-examples) | MLX | Underexplored | No | Promising; smaller footprint |
| OpenAI Python whisper | PyTorch / MPS | Slow (~7–9 tok/s on M-series) | Yes | Reference quality; not for production |
Key tradeoffs
- Latency vs. quality.
tiny/baseruns in real-time on a phone;large-v3requires a desktop-class GPU/ANE for real-time. Most Locara use cases wantsmall/medium-class quality at real-time on M-series. - Streaming vs. file-mode. Whisper natively transcribes 30-second windows; real-time streaming requires VAD + chunking. whisper.cpp has streaming wrappers; WhisperKit ships streaming first-class.
- Diarization gap. Vanilla Whisper doesn’t speaker-label. Add-on systems (
pyannote.audio, WhisperX) handle this but increase complexity. Apple SpeechAnalyzer reportedly includes diarization in its newer versions. - Model size on disk.
large-v3is ~3 GB FP16, ~1.5 GB Q5, ~700 MB Q4. Real-world Mac apps usually shipsmall(~250 MB) ormedium(~770 MB). - Language coverage. Whisper supports 99 languages; quality drops sharply outside top 20. Apple SpeechAnalyzer is initially English-strong; Voxtral targets broader multilingual parity.
- WER baseline. Whisper Large-v3: ~2.7% on librispeech-clean, ~8–12% on conversational real-world speech.
What worked
- Whisper’s open release rebooted the STT space. Pre-Whisper, high-quality STT meant cloud APIs (Google Speech, AWS Transcribe). Whisper put state-of-the-art on a laptop overnight.
- whisper.cpp’s portability made Whisper usable on every device class within months of its release.
- CoreML acceleration via the ANE turned Apple Silicon into the best-per-watt Whisper platform.
- WhisperKit’s Swift-native API filled the gap for Mac-native apps that wanted CoreML without C++ FFI gymnastics.
- Apple SpeechAnalyzer provides a “good enough” STT with zero binary cost — apps can use it for casual transcription without shipping any model.
- The streaming wrapper ecosystem (whisper-streaming, whisper-live, etc.) made real-time transcription viable.
What failed / criticisms
- Hallucinations on silence and music. Whisper invents plausible-sounding text when the audio has no speech, especially in
large-v3. Real-world bug source. - Diarization is still a gap. Most local Whisper deployments don’t speaker-label cleanly without add-on systems.
- Languages outside top 20 are weak. Quality drops sharply.
- Streaming is a bolt-on, not native. Whisper’s design assumes 30-second windows; real-time wrappers do clever chunking but introduce latency and edge cases.
- Quality varies with the runtime. The same Whisper weights produce subtly different results across whisper.cpp, WhisperKit, and the reference Python. Numerical drift, FP16 accumulation, ANE quirks all contribute.
- WhisperKit and FluidAudio fragment the Apple-Silicon space. Two competing CoreML-optimized runtimes both targeting the same niche.
- Model file sizes are still substantial.
largeis too big for many consumer apps;smallandmediumare workable but have a quality ceiling.
Specific learnings for Locara
- Default to WhisperKit on Apple Silicon, with whisper.cpp as the cross-platform fallback. Same MLX-default-with-llama.cpp-fallback pattern, applied to STT. WhisperKit is the throughput leader on Mac; whisper.cpp gets you Linux/Windows when v2 needs them.
- Don’t depend on Apple SpeechAnalyzer for the Transcribe app. It’s tempting (zero binary cost, system-level), but: (a) it ties you to macOS 26+, (b) it’s closed, (c) it doesn’t support the customization a Locara app might want (custom vocabularies, fine-tuning, model-version pinning). Use it as an option the app can declare in its manifest, not as the default.
- Pin a specific Whisper model version per device class. Like the LLM model pinning —
whisper-large-v3-turbo-q5for M2+ Pro/Max with adequate disk;whisper-medium-q5for base M-series;whisper-small-q5for older Intel Macs in v2. The manifest should declare device-class targets. - Diarization is a separate capability — declare it. Locara’s Transcribe app should manifest
requires.audioSpeakerDiarization: trueif it does it, with the runtime selecting an appropriate diarization stack (pyannote, WhisperX, or whatever shipped). Don’t bake it implicitly into the STT capability. - Streaming is a UX feature, not a research project. Use a proven streaming wrapper (whisper.cpp’s stream binary, WhisperKit’s streaming API). Don’t build a custom chunker for v1.
- Hallucinations on silence are a real bug to design against. Locara’s Transcribe should detect and suppress hallucinated outputs from silence/music — this is a known failure mode that needs explicit handling, not “ignore and hope.”
- Use ANE acceleration, but treat it as opportunistic. ANE gives a ~3× speedup when it’s available and the model fits. Code defensively — fall back to GPU/CPU paths cleanly if ANE is busy or the model variant isn’t supported.
- Quantize aggressively for STT. Whisper Q5 is nearly indistinguishable from FP16 for typical use; Q4 is acceptable for most. Disk savings are meaningful (3 GB → 700 MB for
large). - Be honest about WER in the UX. Real-world conversational WER is 8–12%, not benchmark 2.7%. Surface uncertainty in the UI (e.g., low-confidence words highlighted) rather than presenting transcripts as ground truth.
- Watch Voxtral and Apple’s roadmap. STT is evolving fast; Locara’s runtime should be able to swap STT models without app-author changes. The manifest abstraction is the right level — apps declare “I need real-time STT with diarization in en+es,” runtime picks the model.
References
- https://github.com/openai/whisper (original Whisper release, Sept 2022)
- https://github.com/ggml-org/whisper.cpp (ggerganov’s port; tracking the GGML org since the 2024 split)
- https://github.com/argmaxinc/WhisperKit (Argmax’s CoreML-optimized Swift implementation)
- https://www.argmaxinc.com/blog/apple-and-argmax (SpeechAnalyzer benchmarking blog post)
- https://github.com/anvanvan/mac-whisper-speedtest (community benchmark comparing Mac Whisper implementations)
- “Whisper Performance on Apple Silicon: M1, M2, M3, M4 Benchmarks” — Voicci blog
- “How Accurate Is Whisper in 2026?” — NovaScribe (WER data by language)
- Voxtral release materials (Meta, 2025)
- See also:
llama-cpp.md(the underlying ggml family),mlx.md(an alternative Whisper substrate via MLX-Whisper)