Locara

zero-shot-image-classification

HF group: Computer Vision · Status: ❌ not built

What it is

Image → label from a runtime-supplied list (no fine-tuning). CLIP-style: encode image and labels, pick the closest. Critical for app authors who don’t want to retrain.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
CLIP ViT-L/14428 M2021MITFoundationalMany fine-tunes.
OpenCLIP-bigG~2 B2023MITTop of classLAION-pretrained.
SigLIP400 M – 1.1 B2024Apache-2.0Beats CLIP on most benchesSigmoid loss — better calibrated.
MetaCLIP200 M – 1 B2024CC-BY-NCStrongCurated training data.

Infrastructure required

Inference

Input

  • ❌ Image input pipeline.
  • Runtime-supplied label list (text array).

Output

  • Per-label similarity score; pick top-1 or top-K.

Storage

  • ❌ Weights cache.

Interaction (IPC + SDK)

  • vision.classify_zero_shot({ image, labels }) IPC.

Capabilities (manifest)

  • capabilities.fs.user-selected or device.camera.
  • capabilities.models[] for the CLIP-class model.

Gaps

  • Image encoder inference path.
  • Joint text-image embedding storage / runtime.

See also