audio-to-embedding
HF group: Audio · Status: ❌ not built
What it is
Audio → fixed-size float vector. Used for audio search,
classification, similarity. The audio analog of
text-to-embedding.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| CLAP (LAION) | ~636 M | 2023 | CC0 | Standard text-audio joint embedding | LAION CLAP variants are most-used. |
| LAION-CLAP-music | ~636 M | 2023 | CC0 | Music-tuned | Better for music search. |
| MERT-v1-330M | 330 M | 2024 | Apache-2.0 | Music-focused | Self-supervised on music. |
| Wav2Vec2-Large | 317 M | 2020 | Apache-2.0 | Speech features | Foundational, but more for downstream tasks. |
Infrastructure required
Inference
- ❌ Audio encoder runtime (encoder-only-for-non-text rail).
Input
- ✅ Audio capture / file load (shared with
speech-to-text).
Output
- ❌
Vec<f32>per audio clip.
Storage
- ❌ Weights cache.
- ✅ Vector store via
sqlite-vec(shared withtext-to-embedding).
Interaction (IPC + SDK)
- ❌
embed.audioIPC stub reserved in spec, not implemented.
Capabilities (manifest)
capabilities.device.microphoneorfs.user-selected.capabilities.models[].
Gaps
Audio encoder runtime. locara-audio-embed crate or extension
to locara-llama for CLAP-class encoder support.
See also
text-to-embeddingaudio-classification— same encoder class- Crates:
locara-storage(vector store) - Index:
../modalities-and-models-survey.md