image-feature-extraction
HF group: Computer Vision · Status: ❌ not built · Tier 2
What it is
Image → fixed-size float vector. Used for visual search, deduplication, classification head training, downstream tasks.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| DINOv2 (Small / Base / Large / Giant) | 22 M – 1.1 B | 2023 | Apache-2.0 | Self-supervised SOTA visual features | Trained on 142 M curated images, no labels. Linear-probe quality across tasks. |
| CLIP / OpenCLIP | 86 M – 2 B | 2021-23 | MIT | Joint text-image | Use when text query matters. |
| SigLIP | 400 M – 1.1 B | 2024 | Apache-2.0 | Better calibrated than CLIP | Default text-image embed today. |
Infrastructure required
Inference
- ❌ Vision encoder runtime (encoder-only-for-non-text rail).
Input
- ❌ Image input pipeline.
Output
- ❌
Vec<f32>per image. Output dimensionality model-dependent.
Storage
- ❌ Weights cache.
- ✅ Vector store via
sqlite-vec(shared withtext-to-embedding). - ❌ Multi-vector storage (for ColPali-style late interaction) — sqlite-vec supports it via SQL but the SDK doesn’t expose it cleanly.
Interaction (IPC + SDK)
- ❌
embed.imageIPC stub reserved in spec, not wired.
Capabilities (manifest)
capabilities.fs.user-selectedordevice.camera.capabilities.models[]for the encoder.
Gaps
- Image encoder runtime.
- Multi-modal (text+image) joint embedding for visual search.
- Multi-vector storage if shipping ColPali (see
visual-document-retrieval).
See also
text-to-embeddingzero-shot-image-classification— same model classvisual-document-retrieval- Crates:
locara-storage(sqlite-vec) - Index:
../modalities-and-models-survey.md