Locara

image-segmentation

HF group: Computer Vision · Status: ❌ not built · Tier 2 (high leverage)

Covers: HF’s image-segmentation AND mask-generation. SAM 2 does both in a single model.

What it is

Image → pixel masks. Two flavors:

  • Semantic / panoptic segmentation: fixed label set (Mask2Former-class).
  • Mask generation (SAM-style): interactive prompts (point / box / text) → arbitrary masks. Model has no notion of the “what” — it segments whatever you point at.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
SAM 238-220 M2024-08Apache-2.0First unified image+video segmenterReal-time (~44 FPS); 3× fewer prompts than SAM 1. Includes video tracking via masklets.
Mask2Former47-216 M2022MITStandard semantic / panoptic segmenterClosed label set.
Grounded-SAM 2~400 M2026Apache-2.0Text query → segmented masks across videoComposes Grounding DINO + SAM 2.
SegFormer4-85 M2021NVIDIA-SourceLightweightEdge-friendly.

Infrastructure required

Inference

  • ❌ Vision encoder + decoder runtime. SAM 2 is small (38-220 M) — fits ONNX or Candle.

Input

  • ❌ Image / video input pipeline.
  • Interactive prompt: point / box / text (for SAM-style).

Output

  • ❌ Pixel masks (binary or label index).
  • Mask overlay UI in @locara/components.
  • For video: SAM 2 keeps inference state across frames — needs session API.

Storage

  • ❌ Weights cache (SAM 2 is small enough that this is trivial).
  • Per-session state for video tracking.

Interaction (IPC + SDK)

  • vision.segment({ image }) and vision.mask_from_prompt({ image, point | box | text }) IPC.
  • For video: session-based (vision.segment_video_*) similar to voice sessions.

Capabilities (manifest)

  • capabilities.fs.user-selected or device.camera.
  • capabilities.models[] for SAM 2.

Gaps

SAM 2 in particular is cheap-to-run and unlocks a lot of vision app workflows (background removal, object extraction, video object tracking). Worth elevating in the priority list — it’s in Tier 2 of the BACKLOG punch-list.

See also