`image-feature-extraction`

HF group: Computer Vision · Status: ❌ not built · Tier 2

What it is

Image → fixed-size float vector. Used for visual search, deduplication, classification head training, downstream tasks.

Open-weight models

Model	Params	Released	License	Quality	Notes
DINOv2 (Small / Base / Large / Giant)	22 M – 1.1 B	2023	Apache-2.0	Self-supervised SOTA visual features	Trained on 142 M curated images, no labels. Linear-probe quality across tasks.
CLIP / OpenCLIP	86 M – 2 B	2021-23	MIT	Joint text-image	Use when text query matters.
SigLIP	400 M – 1.1 B	2024	Apache-2.0	Better calibrated than CLIP	Default text-image embed today.

Infrastructure required

Inference

❌ Vision encoder runtime (encoder-only-for-non-text rail).

Input

❌ Image input pipeline.

Output

❌ Vec<f32> per image. Output dimensionality model-dependent.

Storage

❌ Weights cache.
✅ Vector store via sqlite-vec (shared with text-to-embedding).
❌ Multi-vector storage (for ColPali-style late interaction) — sqlite-vec supports it via SQL but the SDK doesn’t expose it cleanly.

Interaction (IPC + SDK)

❌ embed.image IPC stub reserved in spec, not wired.

Capabilities (manifest)

capabilities.fs.user-selected or device.camera.
capabilities.models[] for the encoder.

Gaps

Image encoder runtime.
Multi-modal (text+image) joint embedding for visual search.
Multi-vector storage if shipping ColPali (see visual-document-retrieval).

See also

text-to-embedding
zero-shot-image-classification — same model class
visual-document-retrieval
Crates: locara-storage (sqlite-vec)
Index: ../modalities-and-models-survey.md