Whisper and the Local STT Landscape

What it is: A survey of speech-to-text (STT) options for local execution on Apple Silicon, anchored on OpenAI Whisper (the open-weights model family) and the runtimes around it: whisper.cpp (ggerganov’s C++ port), WhisperKit (Argmax’s CoreML-optimized version), FluidAudio CoreML, and Apple SpeechAnalyzer (the new system-level API in macOS 26). Status: Multiple competing runtimes, all maturing fast in 2025–26. WER on benchmark clean audio for Whisper Large-v3 is ~2.7%, with 8–12% real-world WER on conversational speech. Apple’s SpeechAnalyzer matches mid-tier Whisper. Voxtral (Meta, 2025) is the newest open-weights STT entrant. Most relevant to Locara: Locara’s phase-1 reference app Transcribe rides directly on this stack. The choice of runtime — and whether to depend on Apple’s system API or a portable open-weights runtime — is one of the highest-stakes engineering decisions in v1.

Background

OpenAI released Whisper in September 2022 as an open-weights multilingual STT model trained on 680,000 hours of labeled audio. Models came in sizes from tiny (39M params) to large (1.55B), with subsequent updates to large-v2 (Dec 2022), large-v3 (Nov 2023), and large-v3-turbo (Sept 2024 — distilled, faster, slight quality cost).

Within weeks of Whisper’s release, ggerganov released whisper.cpp as a C/C++ port using GGML (the same tensor library that became the substrate for llama.cpp). whisper.cpp added Apple Neural Engine acceleration via CoreML in 2023, claiming >3× speedup vs CPU-only. By 2026 it’s the most-portable Whisper runtime — CPU (AVX/NEON), Metal, CUDA, OpenCL, Vulkan, and CoreML/ANE backends in one codebase.

WhisperKit (from Argmax, founded by ex-Apple engineers) is a Swift-native, CoreML-optimized Whisper implementation released in 2024. It targets Apple Silicon specifically, leveraging ANE more aggressively than whisper.cpp’s CoreML path. FluidAudio CoreML is a newer entrant in similar territory.

In 2025, Apple announced SpeechAnalyzer as part of macOS 26 — a system-level API for high-quality on-device speech recognition. Argmax benchmarked SpeechAnalyzer as matching mid-tier OpenAI Whisper models on long-form conversational speech, on M4 Mac mini hardware.

In parallel, Voxtral (Meta, 2025) is a newer open-weights STT model competing with Whisper on quality and supporting more languages.

The current options for a Locara Mac app (2026)

Runtime	Substrate	Apple Silicon perf	Cross-platform	Locara fit
whisper.cpp + CoreML	C/C++ + ggml	Good (3× CPU)	Yes (every platform)	Default fallback; portable
WhisperKit	Swift + CoreML	Best (16% over whisper.cpp on some configs)	No (Apple-only)	Best for Mac-native apps
FluidAudio CoreML	Swift + CoreML	Comparable to WhisperKit	No	Alternative; smaller community
Apple SpeechAnalyzer	System API	Comparable to mid-Whisper	No	Zero binary cost; system-level
MLX-Whisper (mlx-examples)	MLX	Underexplored	No	Promising; smaller footprint
OpenAI Python whisper	PyTorch / MPS	Slow (~7–9 tok/s on M-series)	Yes	Reference quality; not for production

Key tradeoffs

Latency vs. quality. tiny/base runs in real-time on a phone; large-v3 requires a desktop-class GPU/ANE for real-time. Most Locara use cases want small/medium-class quality at real-time on M-series.
Streaming vs. file-mode. Whisper natively transcribes 30-second windows; real-time streaming requires VAD + chunking. whisper.cpp has streaming wrappers; WhisperKit ships streaming first-class.
Diarization gap. Vanilla Whisper doesn’t speaker-label. Add-on systems (pyannote.audio, WhisperX) handle this but increase complexity. Apple SpeechAnalyzer reportedly includes diarization in its newer versions.
Model size on disk. large-v3 is ~3 GB FP16, ~1.5 GB Q5, ~700 MB Q4. Real-world Mac apps usually ship small (~250 MB) or medium (~770 MB).
Language coverage. Whisper supports 99 languages; quality drops sharply outside top 20. Apple SpeechAnalyzer is initially English-strong; Voxtral targets broader multilingual parity.
WER baseline. Whisper Large-v3: ~2.7% on librispeech-clean, ~8–12% on conversational real-world speech.

What worked

Whisper’s open release rebooted the STT space. Pre-Whisper, high-quality STT meant cloud APIs (Google Speech, AWS Transcribe). Whisper put state-of-the-art on a laptop overnight.
whisper.cpp’s portability made Whisper usable on every device class within months of its release.
CoreML acceleration via the ANE turned Apple Silicon into the best-per-watt Whisper platform.
WhisperKit’s Swift-native API filled the gap for Mac-native apps that wanted CoreML without C++ FFI gymnastics.
Apple SpeechAnalyzer provides a “good enough” STT with zero binary cost — apps can use it for casual transcription without shipping any model.
The streaming wrapper ecosystem (whisper-streaming, whisper-live, etc.) made real-time transcription viable.

What failed / criticisms

Hallucinations on silence and music. Whisper invents plausible-sounding text when the audio has no speech, especially in large-v3. Real-world bug source.
Diarization is still a gap. Most local Whisper deployments don’t speaker-label cleanly without add-on systems.
Languages outside top 20 are weak. Quality drops sharply.
Streaming is a bolt-on, not native. Whisper’s design assumes 30-second windows; real-time wrappers do clever chunking but introduce latency and edge cases.
Quality varies with the runtime. The same Whisper weights produce subtly different results across whisper.cpp, WhisperKit, and the reference Python. Numerical drift, FP16 accumulation, ANE quirks all contribute.
WhisperKit and FluidAudio fragment the Apple-Silicon space. Two competing CoreML-optimized runtimes both targeting the same niche.
Model file sizes are still substantial. large is too big for many consumer apps; small and medium are workable but have a quality ceiling.

Specific learnings for Locara

Default to WhisperKit on Apple Silicon, with whisper.cpp as the cross-platform fallback. Same MLX-default-with-llama.cpp-fallback pattern, applied to STT. WhisperKit is the throughput leader on Mac; whisper.cpp gets you Linux/Windows when v2 needs them.
Don’t depend on Apple SpeechAnalyzer for the Transcribe app. It’s tempting (zero binary cost, system-level), but: (a) it ties you to macOS 26+, (b) it’s closed, (c) it doesn’t support the customization a Locara app might want (custom vocabularies, fine-tuning, model-version pinning). Use it as an option the app can declare in its manifest, not as the default.
Pin a specific Whisper model version per device class. Like the LLM model pinning — whisper-large-v3-turbo-q5 for M2+ Pro/Max with adequate disk; whisper-medium-q5 for base M-series; whisper-small-q5 for older Intel Macs in v2. The manifest should declare device-class targets.
Diarization is a separate capability — declare it. Locara’s Transcribe app should manifest requires.audioSpeakerDiarization: true if it does it, with the runtime selecting an appropriate diarization stack (pyannote, WhisperX, or whatever shipped). Don’t bake it implicitly into the STT capability.
Streaming is a UX feature, not a research project. Use a proven streaming wrapper (whisper.cpp’s stream binary, WhisperKit’s streaming API). Don’t build a custom chunker for v1.
Hallucinations on silence are a real bug to design against. Locara’s Transcribe should detect and suppress hallucinated outputs from silence/music — this is a known failure mode that needs explicit handling, not “ignore and hope.”
Use ANE acceleration, but treat it as opportunistic. ANE gives a ~3× speedup when it’s available and the model fits. Code defensively — fall back to GPU/CPU paths cleanly if ANE is busy or the model variant isn’t supported.
Quantize aggressively for STT. Whisper Q5 is nearly indistinguishable from FP16 for typical use; Q4 is acceptable for most. Disk savings are meaningful (3 GB → 700 MB for large).
Be honest about WER in the UX. Real-world conversational WER is 8–12%, not benchmark 2.7%. Surface uncertainty in the UI (e.g., low-confidence words highlighted) rather than presenting transcripts as ground truth.
Watch Voxtral and Apple’s roadmap. STT is evolving fast; Locara’s runtime should be able to swap STT models without app-author changes. The manifest abstraction is the right level — apps declare “I need real-time STT with diarization in en+es,” runtime picks the model.

References

https://github.com/openai/whisper (original Whisper release, Sept 2022)
https://github.com/ggml-org/whisper.cpp (ggerganov’s port; tracking the GGML org since the 2024 split)
https://github.com/argmaxinc/WhisperKit (Argmax’s CoreML-optimized Swift implementation)
https://www.argmaxinc.com/blog/apple-and-argmax (SpeechAnalyzer benchmarking blog post)
https://github.com/anvanvan/mac-whisper-speedtest (community benchmark comparing Mac Whisper implementations)
“Whisper Performance on Apple Silicon: M1, M2, M3, M4 Benchmarks” — Voicci blog
“How Accurate Is Whisper in 2026?” — NovaScribe (WER data by language)
Voxtral release materials (Meta, 2025)
See also: llama-cpp.md (the underlying ggml family), mlx.md (an alternative Whisper substrate via MLX-Whisper)