audio-classification
HF group: Audio · Status: ❌ not built
What it is
Sound → label. Covers genre detection, emotion classification, acoustic scene classification, bird species ID, alarm detection, etc.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| BEATs | ~90 M | MIT | MIT | Self-supervised; solid baseline | Microsoft. |
| AST (Audio Spectrogram Transformer) | ~88 M | 2021 | BSD-3 | Fast on CPU | Foundational. |
| YAMNet (TF Hub port) | ~5 M | 2020 | Apache-2.0 | 521 AudioSet classes | Very lightweight. |
Apple Vision (SoundAnalysis) | n/a | macOS | Apple | Built-in scene/sound tags | Native API. |
Infrastructure required
Inference
- ❌ Audio encoder runtime (encoder-only-for-non-text rail).
- ✅ Apple SoundAnalysis would be cheapest first hook — same Swift-sidecar pattern as
locara-vision-ocr.
Input
- ✅ Audio capture / file load.
Output
- Label + confidence; small JSON.
Storage
- ❌ Weights cache (or none for native API).
- App-side: classification results in
locara-storage.
Interaction (IPC + SDK)
- ❌
audio.classify({ audio })IPC.
Capabilities (manifest)
capabilities.device.microphoneorfs.user-selected.capabilities.models[](or none for Apple).
Gaps
Audio encoder runtime. Apple SoundAnalysis would be cheapest
first hook — same pattern as locara-vision-ocr.
See also
voice-activity-detection— adjacent taskaudio-to-embedding- Index:
../modalities-and-models-survey.md