Locara

text-to-audio (sound effects)

HF group: Audio · Status: ❌ not built

What it is

Text → ambient sound, sound effects, foley. Distinct from text-to-music (rhythmic) and text-to-speech (linguistic).

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Stable Audio Open 1.0~1 B2024-07Stability communityStrong on non-musical SFXBetter than AudioLDM / AudioGen for SFX.
AudioLDM 2~700 M2023CC-BY-NCSolidSlightly older.
MAGNeT-medium-30s~1.5 B2024CC-BY-NC7× faster than autoregressive baselinesNon-autoregressive; suits real-time.
MOSS-Audio~8 B2026-04Apache-2.0Speech / sound / music in one modelNewer entrant; also good for audio-text-to-text.

Infrastructure required

Inference

  • ❌ Audio diffusion / autoregressive runtime.
  • ❌ Audio codec runtime (typically EnCodec or DAC) for tokenized audio.

Input

  • Plain text prompt.

Output

  • ❌ Audio file save.
  • ❌ Format conversion (WAV → MP3 / FLAC via symphonia or libavcodec).

Storage

  • ❌ Weights cache.
  • Output: fs.user-folder.

Interaction (IPC + SDK)

  • audio.generate({ prompt, type: 'sfx' }) IPC.

Capabilities (manifest)

  • capabilities.fs.user-folder write.
  • capabilities.models[].

Gaps

Whole audio-generation stack. Diffusion / autoregressive audio runtime crate, audio output IPC, format conversion.

See also