`image-segmentation`

HF group: Computer Vision · Status: ❌ not built · Tier 2 (high leverage)

Covers: HF’s image-segmentation AND mask-generation. SAM 2 does both in a single model.

What it is

Image → pixel masks. Two flavors:

Semantic / panoptic segmentation: fixed label set (Mask2Former-class).
Mask generation (SAM-style): interactive prompts (point / box / text) → arbitrary masks. Model has no notion of the “what” — it segments whatever you point at.

Open-weight models

Model	Params	Released	License	Quality	Notes
SAM 2	38-220 M	2024-08	Apache-2.0	First unified image+video segmenter	Real-time (~44 FPS); 3× fewer prompts than SAM 1. Includes video tracking via masklets.
Mask2Former	47-216 M	2022	MIT	Standard semantic / panoptic segmenter	Closed label set.
Grounded-SAM 2	~400 M	2026	Apache-2.0	Text query → segmented masks across video	Composes Grounding DINO + SAM 2.
SegFormer	4-85 M	2021	NVIDIA-Source	Lightweight	Edge-friendly.

Infrastructure required

Inference

❌ Vision encoder + decoder runtime. SAM 2 is small (38-220 M) — fits ONNX or Candle.

Input

❌ Image / video input pipeline.
Interactive prompt: point / box / text (for SAM-style).

Output

❌ Pixel masks (binary or label index).
❌ Mask overlay UI in @locara/components.
For video: SAM 2 keeps inference state across frames — needs session API.

Storage

❌ Weights cache (SAM 2 is small enough that this is trivial).
Per-session state for video tracking.

Interaction (IPC + SDK)

❌ vision.segment({ image }) and vision.mask_from_prompt({ image, point | box | text }) IPC.
For video: session-based (vision.segment_video_*) similar to voice sessions.

Capabilities (manifest)

capabilities.fs.user-selected or device.camera.
capabilities.models[] for SAM 2.

Gaps

SAM 2 in particular is cheap-to-run and unlocks a lot of vision app workflows (background removal, object extraction, video object tracking). Worth elevating in the priority list — it’s in Tier 2 of the BACKLOG punch-list.

image-segmentation