text-to-audio (sound effects)
HF group: Audio · Status: ❌ not built
What it is
Text → ambient sound, sound effects, foley. Distinct from
text-to-music (rhythmic) and
text-to-speech (linguistic).
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Stable Audio Open 1.0 | ~1 B | 2024-07 | Stability community | Strong on non-musical SFX | Better than AudioLDM / AudioGen for SFX. |
| AudioLDM 2 | ~700 M | 2023 | CC-BY-NC | Solid | Slightly older. |
| MAGNeT-medium-30s | ~1.5 B | 2024 | CC-BY-NC | 7× faster than autoregressive baselines | Non-autoregressive; suits real-time. |
| MOSS-Audio | ~8 B | 2026-04 | Apache-2.0 | Speech / sound / music in one model | Newer entrant; also good for audio-text-to-text. |
Infrastructure required
Inference
- ❌ Audio diffusion / autoregressive runtime.
- ❌ Audio codec runtime (typically EnCodec or DAC) for tokenized audio.
Input
- Plain text prompt.
Output
- ❌ Audio file save.
- ❌ Format conversion (WAV → MP3 / FLAC via
symphoniaor libavcodec).
Storage
- ❌ Weights cache.
- Output:
fs.user-folder.
Interaction (IPC + SDK)
- ❌
audio.generate({ prompt, type: 'sfx' })IPC.
Capabilities (manifest)
capabilities.fs.user-folderwrite.capabilities.models[].
Gaps
Whole audio-generation stack. Diffusion / autoregressive audio runtime crate, audio output IPC, format conversion.
See also
text-to-music— same infratext-to-speech- Index:
../modalities-and-models-survey.md