Locara

image-feature-extraction

HF group: Computer Vision · Status: ❌ not built · Tier 2

What it is

Image → fixed-size float vector. Used for visual search, deduplication, classification head training, downstream tasks.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
DINOv2 (Small / Base / Large / Giant)22 M – 1.1 B2023Apache-2.0Self-supervised SOTA visual featuresTrained on 142 M curated images, no labels. Linear-probe quality across tasks.
CLIP / OpenCLIP86 M – 2 B2021-23MITJoint text-imageUse when text query matters.
SigLIP400 M – 1.1 B2024Apache-2.0Better calibrated than CLIPDefault text-image embed today.

Infrastructure required

Inference

  • ❌ Vision encoder runtime (encoder-only-for-non-text rail).

Input

  • ❌ Image input pipeline.

Output

  • Vec<f32> per image. Output dimensionality model-dependent.

Storage

  • ❌ Weights cache.
  • ✅ Vector store via sqlite-vec (shared with text-to-embedding).
  • ❌ Multi-vector storage (for ColPali-style late interaction) — sqlite-vec supports it via SQL but the SDK doesn’t expose it cleanly.

Interaction (IPC + SDK)

  • embed.image IPC stub reserved in spec, not wired.

Capabilities (manifest)

  • capabilities.fs.user-selected or device.camera.
  • capabilities.models[] for the encoder.

Gaps

  • Image encoder runtime.
  • Multi-modal (text+image) joint embedding for visual search.
  • Multi-vector storage if shipping ColPali (see visual-document-retrieval).

See also