`zero-shot-image-classification`

HF group: Computer Vision · Status: ❌ not built

What it is

Image → label from a runtime-supplied list (no fine-tuning). CLIP-style: encode image and labels, pick the closest. Critical for app authors who don’t want to retrain.

Open-weight models

Model	Params	Released	License	Quality	Notes
CLIP ViT-L/14	428 M	2021	MIT	Foundational	Many fine-tunes.
OpenCLIP-bigG	~2 B	2023	MIT	Top of class	LAION-pretrained.
SigLIP	400 M – 1.1 B	2024	Apache-2.0	Beats CLIP on most benches	Sigmoid loss — better calibrated.
MetaCLIP	200 M – 1 B	2024	CC-BY-NC	Strong	Curated training data.

Infrastructure required

Inference

❌ Joint image + text encoder (CLIP-style). Shares the encoder-only-for-non-text rail with image-feature-extraction.

Input

❌ Image input pipeline.
Runtime-supplied label list (text array).

Output

Per-label similarity score; pick top-1 or top-K.

Storage

❌ Weights cache.

Interaction (IPC + SDK)

❌ vision.classify_zero_shot({ image, labels }) IPC.

Capabilities (manifest)

capabilities.fs.user-selected or device.camera.
capabilities.models[] for the CLIP-class model.

Gaps

Image encoder inference path.
Joint text-image embedding storage / runtime.

See also