`audio-to-audio`

HF group: Audio · Status: ❌ not built

What it is

Audio → audio. Covers:

Denoising — clean speech from noisy recording.
Voice conversion — make speaker A sound like speaker B.
Source separation — extract vocals from a mix; isolate instruments.
Super-resolution — upsample low-rate audio.

Open-weight models

Model	Params	Released	License	Quality	Notes
DEMUCS v4	~80 M	2023	MIT	Best music source separation	Stems from mix.
Open-Unmix	8 M	2019	MIT	Lightweight music sep	Older.
Deep Filter Net 3	2 M	2023	MIT	Real-time speech denoising	Edge-friendly.
RVC (Retrieval-based VC)	100-200 M	2023	MIT	Voice conversion	Many community variants.
kNN-VC	~100 M	2023	MIT	Voice conversion	High quality.

Infrastructure required

Inference

❌ Audio encoder-decoder runtime (varies per model: ONNX / Candle).
Some are tiny (Deep Filter Net 3 = 2 M) — fits everywhere.

Input

✅ Audio capture / file load (shared with speech-to-text).

Output

❌ Audio file save OR streaming back through the audio playback queue (for real-time denoising).

Storage

❌ Weights cache.
Output: fs.user-folder (file ops) or in-memory (real-time pre-processor).

Interaction (IPC + SDK)

❌ audio.transform({ input, op }) where op selects denoise / sep / VC.

Capabilities (manifest)

capabilities.device.microphone (live) or fs.user-selected (file).
capabilities.fs.user-folder for save.
capabilities.models[].

Gaps

Whole stack. Most useful first deliverable: Deep Filter Net 3 (2 M params!) for real-time speech denoising — could plug into the existing voice pipeline as a pre-processor.

See also

speech-to-text — denoising would feed in
voice-to-voice
text-to-music — DEMUCS for source separation
Index: ../modalities-and-models-survey.md