image-text-to-image
HF group: Multimodal · Status: ❌ not built
What it is
Image + text instruction → edited image. The “make this person
smile” / “remove the watermark” / “add a hat” task. Distinct
from image-to-image (no text — pure
restoration / super-res / style transfer).
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| FLUX.1 Kontext [dev] | 12 B | 2025 | FLUX-1-dev (non-comm.) | Best open-weight edit model | Text-instruction edits over an input image. |
| Stable Diffusion 3.5 + ControlNet | 2.5-8 B | 2024 | Stability community | Strong with right ControlNet | Pose / depth / sketch guidance. |
| InstructPix2Pix | 1 B | 2023 | CreativeML | Older, still useful | Light edit instructions. |
Infrastructure required
Inference
- ❌ Diffusion runtime (shared with
text-to-image). - ❌ Quantization path to fit 12 B FLUX on a 24 GB Mac.
Input
- ❌ Image input pipeline.
- Text instruction.
- ❌ Mask UI for region-targeted edits (inpainting).
Output
- ❌ Edited image bytes; save to user-folder.
- Progressive preview during sampling (streaming partial latents).
Storage
- ❌ Weights cache.
- Output:
fs.user-folderfor save.
Interaction (IPC + SDK)
- ❌
image.edit({ image, prompt, mask? })IPC with progress events.
Capabilities (manifest)
capabilities.fs.user-selectedfor input image.capabilities.fs.user-folderfor save.capabilities.models[]for the diffusion model.
Gaps
- Same as
text-to-image— diffusion runtime crate doesn’t exist. - Inpainting UI affordances in
@locara/components. - Image input pipeline in the SDK.
See also
text-to-image— same runtimeimage-to-image— no-text variant- Index:
../modalities-and-models-survey.md