text-to-speech
HF group: Audio · Status: 🟡 partial (system
sayonly)
What it is
Text → speech audio.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Kokoro-82M | 82 M | 2025-01 | Apache-2.0 | #1 on TTS Arena | Sub-300 ms per sentence; no native voice cloning. Best small-footprint pick. |
| F5-TTS | ~330 M | 2024 | CC-BY-NC | High fidelity, voice clones from seconds | NON-COMMERCIAL license. |
| XTTS-v2 | 467 M | 2024 | Coqui CPML | 17-language voice clones | License requires Coqui contact for commercial. |
| Bark | ~1 B | 2023 | MIT | Expressive, slow | Useful for prototyping. |
| Piper | ~30 M | 2023 | MIT | Fast, robotic | Edge-device friendly. Per-voice models. |
Apple AVSpeech / say | n/a | macOS | Apple | Excellent on Apple voices | System-provided; what Locara’s voice-pipeline uses today. |
| Apple Speech Synthesizer (Neural Voices) | n/a | macOS 14+ | Apple | Studio-quality | Bigger neural voices on macOS. |
Infrastructure required
Inference
- ✅ macOS
say/ AVSpeech viacrates/locara-voice-pipeline/src/say.rs(zero-RAM, system voices). - ❌ Kokoro / Piper integration — would need an MLX or ONNX path.
- ❌ Voice-cloning TTS (F5-TTS / XTTS) — license complexity (non-commercial).
Input
- Plain UTF-8 text strings, optional voice + locale.
Output
- ✅ Streaming audio playback.
apps/voiceuses sentence chunking (pumpSentenceTts) so the first sentence plays while the rest stream. - ✅ Cancellation + interrupt (barge-in) supported by the pipeline backend.
Storage
- For ML models: ❌ — no weights cached for TTS today (system
saydoesn’t need them). - ✅ User’s voice-preference stored in app data via
locara-storage.
Interaction (IPC + SDK)
- ✅ Used internally as part of
voice-to-voicepipeline backend; no standalonetts.speakcommand yet. - ❌ Standalone
tts.speakIPC for non-voice apps that just want speech output.
Capabilities (manifest)
- ❌
capabilities.device.speakercool-down semantics — pending. - For ML TTS:
capabilities.models[]would list the TTS model.
Gaps
- Kokoro / Piper integration for cross-platform / for when an app wants more control over the voice.
- Voice-cloning TTS (F5-TTS / XTTS) needs a non-commercial-fork branch since licenses are messy.
- Apple Neural Voices integration (Swift sidecar) on BACKLOG.
- Standalone
tts.speakIPC for non-voice-pipeline apps.
See also
voice-to-voice— composes TTS with STT + LLMspeech-to-text- Crates:
locara-voice-pipeline - Index:
../modalities-and-models-survey.md