`image-text-to-image`

HF group: Multimodal · Status: ❌ not built

What it is

Image + text instruction → edited image. The “make this person smile” / “remove the watermark” / “add a hat” task. Distinct from image-to-image (no text — pure restoration / super-res / style transfer).

Open-weight models

Model	Params	Released	License	Quality	Notes
FLUX.1 Kontext [dev]	12 B	2025	FLUX-1-dev (non-comm.)	Best open-weight edit model	Text-instruction edits over an input image.
Stable Diffusion 3.5 + ControlNet	2.5-8 B	2024	Stability community	Strong with right ControlNet	Pose / depth / sketch guidance.
InstructPix2Pix	1 B	2023	CreativeML	Older, still useful	Light edit instructions.

Infrastructure required

Inference

❌ Diffusion runtime (shared with text-to-image).
❌ Quantization path to fit 12 B FLUX on a 24 GB Mac.

Input

❌ Image input pipeline.
Text instruction.
❌ Mask UI for region-targeted edits (inpainting).

Output

❌ Edited image bytes; save to user-folder.
Progressive preview during sampling (streaming partial latents).

Storage

❌ Weights cache.
Output: fs.user-folder for save.

Interaction (IPC + SDK)

❌ image.edit({ image, prompt, mask? }) IPC with progress events.

Capabilities (manifest)

capabilities.fs.user-selected for input image.
capabilities.fs.user-folder for save.
capabilities.models[] for the diffusion model.

Gaps

Same as text-to-image — diffusion runtime crate doesn’t exist.
Inpainting UI affordances in @locara/components.
Image input pipeline in the SDK.

image-text-to-image