object-detection
HF group: Computer Vision · Status: ❌ not built
Covers: HF’s object-detection AND zero-shot-object-detection.
The two share infrastructure; the difference is whether labels
come from a fixed taxonomy or a runtime text query.
What it is
Image → bounding boxes + labels. Zero-shot variant takes the labels at runtime via text prompts (e.g. “find every red car”).
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| DETR / RT-DETR | ~40-80 M | 2020-24 | Apache-2.0 | Strong, fast | Transformer-based; closed label set. |
| YOLOv10 / YOLOv11 | 3-150 M | 2024-25 | AGPL / commercial | Real-time | Multiple sizes. |
| Grounding DINO 1.5 | ~370 M | 2024 | Apache-2.0 | Best zero-shot detector | Open-vocabulary; takes text prompts. |
| OWLv2 | 100-300 M | 2023 | Apache-2.0 | Strong zero-shot | Google. |
| Apple Vision | n/a | macOS | Apple | Object + face + barcode | Native. |
Infrastructure required
Inference
- ❌ Object-detection runtime (encoder + detection head).
- ✅ Apple Vision integration would be cheapest first move (no model download).
Input
- ❌ Image input pipeline.
- Optional text query (zero-shot variant).
Output
- ❌ Bounding boxes + labels + confidences.
- ❌ Box rendering / overlay UI in
@locara/components.
Storage
- ❌ Weights cache.
Interaction (IPC + SDK)
- ❌
vision.detect({ image })andvision.detect_zero_shot({ image, queries })IPC.
Capabilities (manifest)
capabilities.fs.user-selectedordevice.camera.capabilities.models[]for the detector.
Gaps
- Image input pipeline.
- Box overlay component.
- Apple Vision integration would be cheapest first move (no model download).
See also
image-segmentation— Grounded-SAM composes Grounding DINO + SAM 2image-classification- Index:
../modalities-and-models-survey.md