image-to-image (no text instruction)
HF group: Computer Vision · Status: ❌ not built
What it is
Image → image without a text prompt. Super-resolution,
restoration, denoising, style transfer based on a reference
image. Distinct from
image-text-to-image which takes a
natural-language instruction.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Real-ESRGAN | ~17 M | 2021 | BSD-3 | Lightweight super-res | The default upscaler for many tools. |
| SwinIR | ~12 M | 2021 | Apache-2.0 | Strong restore | Older but quality. |
| Stable Diffusion + img2img | 2.5-8 B | 2024 | Stability community | Style transfer with strength param | Reference-based. |
| SUPIR | ~13 B | 2024 | Apache-2.0 | Photorealistic restoration | Heavy but striking results. |
Infrastructure required
Inference
- ❌ Lightweight ONNX path (Real-ESRGAN, SwinIR — small enough not to need diffusion runtime).
- ❌ Diffusion runtime for diffusion-based variants.
Input
- ❌ Image input pipeline.
Output
- ❌ Edited image bytes; save to user-folder.
Storage
- ❌ Weights cache.
- Output:
fs.user-folder.
Interaction (IPC + SDK)
- ❌
image.transform({ image, op })IPC whereopselects super-res / denoise / etc.
Capabilities (manifest)
capabilities.fs.user-selectedfor input.capabilities.fs.user-folderfor output.capabilities.models[]for the model.
Gaps
Image input pipeline (shared with several other CV modalities). Cleanest first deliverable: Real-ESRGAN as ONNX for super-res — small, fast, no diffusion runtime needed.
See also
image-text-to-image— instruction-driven varianttext-to-image- Index:
../modalities-and-models-survey.md