keypoint-detection
HF group: Computer Vision · Status: ❌ not built
What it is
Image / video → joint locations (pose). Hands, body, face landmarks. Useful for AR, fitness apps, sign-language recognition.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| RTMPose | 2-90 M | 2023 | Apache-2.0 | Real-time SOTA | OpenMMLab. |
| ViTPose | 90-660 M | 2022 | Apache-2.0 | High quality, slower | Transformer. |
| MediaPipe Pose / Hands / Face | n/a | n/a | Apache-2.0 | Real-time on CPU/mobile | Battle-tested at Google. |
Apple Vision (VNDetectHumanBodyPoseRequest) | n/a | macOS | Apple | Native, fast | Free. |
Infrastructure required
Inference
- ❌ Pose-detection runtime.
- ✅ Apple Vision integration would be cheapest first-deliverable.
Input
- ❌ Image / video input pipeline.
Output
- ❌ Joint coordinates + confidences.
- ❌ Skeleton overlay UI in
@locara/components.
Storage
- ❌ Weights cache (or none, for native API).
Interaction (IPC + SDK)
- ❌
vision.pose({ image })IPC.
Capabilities (manifest)
capabilities.fs.user-selectedordevice.camera.capabilities.models[].
Gaps
Same as other CV modalities — image input, overlay UI, encoder- only inference path.
See also
object-detection- Crates:
locara-vision-ocr(template for Swift-sidecar pattern) - Index:
../modalities-and-models-survey.md