Locara

image-text-to-image

HF group: Multimodal · Status: ❌ not built

What it is

Image + text instruction → edited image. The “make this person smile” / “remove the watermark” / “add a hat” task. Distinct from image-to-image (no text — pure restoration / super-res / style transfer).

Open-weight models

ModelParamsReleasedLicenseQualityNotes
FLUX.1 Kontext [dev]12 B2025FLUX-1-dev (non-comm.)Best open-weight edit modelText-instruction edits over an input image.
Stable Diffusion 3.5 + ControlNet2.5-8 B2024Stability communityStrong with right ControlNetPose / depth / sketch guidance.
InstructPix2Pix1 B2023CreativeMLOlder, still usefulLight edit instructions.

Infrastructure required

Inference

  • ❌ Diffusion runtime (shared with text-to-image).
  • ❌ Quantization path to fit 12 B FLUX on a 24 GB Mac.

Input

  • ❌ Image input pipeline.
  • Text instruction.
  • ❌ Mask UI for region-targeted edits (inpainting).

Output

  • ❌ Edited image bytes; save to user-folder.
  • Progressive preview during sampling (streaming partial latents).

Storage

  • ❌ Weights cache.
  • Output: fs.user-folder for save.

Interaction (IPC + SDK)

  • image.edit({ image, prompt, mask? }) IPC with progress events.

Capabilities (manifest)

  • capabilities.fs.user-selected for input image.
  • capabilities.fs.user-folder for save.
  • capabilities.models[] for the diffusion model.

Gaps

  • Same as text-to-image — diffusion runtime crate doesn’t exist.
  • Inpainting UI affordances in @locara/components.
  • Image input pipeline in the SDK.

See also