audio-to-audio
HF group: Audio · Status: ❌ not built
What it is
Audio → audio. Covers:
- Denoising — clean speech from noisy recording.
- Voice conversion — make speaker A sound like speaker B.
- Source separation — extract vocals from a mix; isolate instruments.
- Super-resolution — upsample low-rate audio.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| DEMUCS v4 | ~80 M | 2023 | MIT | Best music source separation | Stems from mix. |
| Open-Unmix | 8 M | 2019 | MIT | Lightweight music sep | Older. |
| Deep Filter Net 3 | 2 M | 2023 | MIT | Real-time speech denoising | Edge-friendly. |
| RVC (Retrieval-based VC) | 100-200 M | 2023 | MIT | Voice conversion | Many community variants. |
| kNN-VC | ~100 M | 2023 | MIT | Voice conversion | High quality. |
Infrastructure required
Inference
- ❌ Audio encoder-decoder runtime (varies per model: ONNX / Candle).
- Some are tiny (Deep Filter Net 3 = 2 M) — fits everywhere.
Input
- ✅ Audio capture / file load (shared with
speech-to-text).
Output
- ❌ Audio file save OR streaming back through the audio playback queue (for real-time denoising).
Storage
- ❌ Weights cache.
- Output:
fs.user-folder(file ops) or in-memory (real-time pre-processor).
Interaction (IPC + SDK)
- ❌
audio.transform({ input, op })whereopselects denoise / sep / VC.
Capabilities (manifest)
capabilities.device.microphone(live) orfs.user-selected(file).capabilities.fs.user-folderfor save.capabilities.models[].
Gaps
Whole stack. Most useful first deliverable: Deep Filter Net 3 (2 M params!) for real-time speech denoising — could plug into the existing voice pipeline as a pre-processor.
See also
speech-to-text— denoising would feed invoice-to-voicetext-to-music— DEMUCS for source separation- Index:
../modalities-and-models-survey.md