visual-document-retrieval
HF group: Multimodal · Status: ❌ not built
What it is
Search a corpus of documents by screenshot similarity, not OCR’d text. The query (text or image) and each page (as a rendered image) get embedded by a VLM; late interaction matches them. Beats OCR + text retrieval on visual documents (charts, tables, infographics).
The ColPali class of models. Highly relevant for the DocVault reference app — replaces the OCR + text-embedding pipeline with a single visually-aware retrieval stage.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| ColPali | ~3 B | 2024 | MIT | First in family | Based on PaliGemma + ColBERT-style late interaction. |
| ColQwen2 | ~7 B | 2024 | Apache-2.0 | Stronger than ColPali on MIRACL/InfographicVQA | Qwen2-VL backbone. |
| ColSmol | ~2 B | 2025 | Apache-2.0 | Smaller, still strong | SmolLM-VL backbone — runs on consumer hardware. |
Infrastructure required
Inference
- ❌ VLM in multi-vector mode (returns one vector per patch, not a single document vector). Different shape from current
embed.embed.
Input
- ❌ PDF rasterizer (shared with
document-question-answering). - Text query + indexed images.
Output
- ❌ Top-K ranking with late-interaction scores.
Storage
- ❌ Weights cache.
- ❌ Multi-vector storage in
sqlite-vec(supported via custom SQL but not exposed cleanly throughlocara-storage). - App-side: page-level vectors (tens to hundreds per document).
Interaction (IPC + SDK)
- ❌
search.documents({ query, corpus })IPC, OR extendembed.embedto return multi-vector results + a separatesearch.rerank_late_interactionop.
Capabilities (manifest)
capabilities.fs.user-selectedfor input documents.capabilities.models[]for the ColPali-class model.
Gaps
- PDF rasterizer crate.
- Multi-vector embedding mode (different IPC shape than current
embed.embed). - Multi-vector storage primitive in
locara-storage(sqlite-vecsupports it via SQL but the SDK doesn’t expose it cleanly).
See also
document-question-answeringtext-to-embeddingimage-text-to-text— same VLM backbones- Crates:
locara-storage(sqlite-vec), would need extension - Index:
../modalities-and-models-survey.md