zero-shot-image-classification
HF group: Computer Vision · Status: ❌ not built
What it is
Image → label from a runtime-supplied list (no fine-tuning). CLIP-style: encode image and labels, pick the closest. Critical for app authors who don’t want to retrain.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| CLIP ViT-L/14 | 428 M | 2021 | MIT | Foundational | Many fine-tunes. |
| OpenCLIP-bigG | ~2 B | 2023 | MIT | Top of class | LAION-pretrained. |
| SigLIP | 400 M – 1.1 B | 2024 | Apache-2.0 | Beats CLIP on most benches | Sigmoid loss — better calibrated. |
| MetaCLIP | 200 M – 1 B | 2024 | CC-BY-NC | Strong | Curated training data. |
Infrastructure required
Inference
- ❌ Joint image + text encoder (CLIP-style). Shares the encoder-only-for-non-text rail with
image-feature-extraction.
Input
- ❌ Image input pipeline.
- Runtime-supplied label list (text array).
Output
- Per-label similarity score; pick top-1 or top-K.
Storage
- ❌ Weights cache.
Interaction (IPC + SDK)
- ❌
vision.classify_zero_shot({ image, labels })IPC.
Capabilities (manifest)
capabilities.fs.user-selectedordevice.camera.capabilities.models[]for the CLIP-class model.
Gaps
- Image encoder inference path.
- Joint text-image embedding storage / runtime.