`visual-document-retrieval`

HF group: Multimodal · Status: ❌ not built

What it is

Search a corpus of documents by screenshot similarity, not OCR’d text. The query (text or image) and each page (as a rendered image) get embedded by a VLM; late interaction matches them. Beats OCR + text retrieval on visual documents (charts, tables, infographics).

The ColPali class of models. Highly relevant for the DocVault reference app — replaces the OCR + text-embedding pipeline with a single visually-aware retrieval stage.

Open-weight models

Model	Params	Released	License	Quality	Notes
ColPali	~3 B	2024	MIT	First in family	Based on PaliGemma + ColBERT-style late interaction.
ColQwen2	~7 B	2024	Apache-2.0	Stronger than ColPali on MIRACL/InfographicVQA	Qwen2-VL backbone.
ColSmol	~2 B	2025	Apache-2.0	Smaller, still strong	SmolLM-VL backbone — runs on consumer hardware.

Infrastructure required

Inference

❌ VLM in multi-vector mode (returns one vector per patch, not a single document vector). Different shape from current embed.embed.

Input

❌ PDF rasterizer (shared with document-question-answering).
Text query + indexed images.

Output

❌ Top-K ranking with late-interaction scores.

Storage

❌ Weights cache.
❌ Multi-vector storage in sqlite-vec (supported via custom SQL but not exposed cleanly through locara-storage).
App-side: page-level vectors (tens to hundreds per document).

Interaction (IPC + SDK)

❌ search.documents({ query, corpus }) IPC, OR extend embed.embed to return multi-vector results + a separate search.rerank_late_interaction op.

Capabilities (manifest)

capabilities.fs.user-selected for input documents.
capabilities.models[] for the ColPali-class model.

Gaps

PDF rasterizer crate.
Multi-vector embedding mode (different IPC shape than current embed.embed).
Multi-vector storage primitive in locara-storage (sqlite-vec supports it via SQL but the SDK doesn’t expose it cleanly).

visual-document-retrieval