image-segmentation
HF group: Computer Vision · Status: ❌ not built · Tier 2 (high leverage)
Covers: HF’s image-segmentation AND mask-generation. SAM 2
does both in a single model.
What it is
Image → pixel masks. Two flavors:
- Semantic / panoptic segmentation: fixed label set (Mask2Former-class).
- Mask generation (SAM-style): interactive prompts (point / box / text) → arbitrary masks. Model has no notion of the “what” — it segments whatever you point at.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| SAM 2 | 38-220 M | 2024-08 | Apache-2.0 | First unified image+video segmenter | Real-time (~44 FPS); 3× fewer prompts than SAM 1. Includes video tracking via masklets. |
| Mask2Former | 47-216 M | 2022 | MIT | Standard semantic / panoptic segmenter | Closed label set. |
| Grounded-SAM 2 | ~400 M | 2026 | Apache-2.0 | Text query → segmented masks across video | Composes Grounding DINO + SAM 2. |
| SegFormer | 4-85 M | 2021 | NVIDIA-Source | Lightweight | Edge-friendly. |
Infrastructure required
Inference
- ❌ Vision encoder + decoder runtime. SAM 2 is small (38-220 M) — fits ONNX or Candle.
Input
- ❌ Image / video input pipeline.
- Interactive prompt: point / box / text (for SAM-style).
Output
- ❌ Pixel masks (binary or label index).
- ❌ Mask overlay UI in
@locara/components. - For video: SAM 2 keeps inference state across frames — needs session API.
Storage
- ❌ Weights cache (SAM 2 is small enough that this is trivial).
- Per-session state for video tracking.
Interaction (IPC + SDK)
- ❌
vision.segment({ image })andvision.mask_from_prompt({ image, point | box | text })IPC. - For video: session-based (
vision.segment_video_*) similar to voice sessions.
Capabilities (manifest)
capabilities.fs.user-selectedordevice.camera.capabilities.models[]for SAM 2.
Gaps
SAM 2 in particular is cheap-to-run and unlocks a lot of vision app workflows (background removal, object extraction, video object tracking). Worth elevating in the priority list — it’s in Tier 2 of the BACKLOG punch-list.
See also
object-detection— Grounded-SAM 2 chains themimage-text-to-text- Index:
../modalities-and-models-survey.md