Locara

visual-document-retrieval

HF group: Multimodal · Status: ❌ not built

What it is

Search a corpus of documents by screenshot similarity, not OCR’d text. The query (text or image) and each page (as a rendered image) get embedded by a VLM; late interaction matches them. Beats OCR + text retrieval on visual documents (charts, tables, infographics).

The ColPali class of models. Highly relevant for the DocVault reference app — replaces the OCR + text-embedding pipeline with a single visually-aware retrieval stage.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
ColPali~3 B2024MITFirst in familyBased on PaliGemma + ColBERT-style late interaction.
ColQwen2~7 B2024Apache-2.0Stronger than ColPali on MIRACL/InfographicVQAQwen2-VL backbone.
ColSmol~2 B2025Apache-2.0Smaller, still strongSmolLM-VL backbone — runs on consumer hardware.

Infrastructure required

Inference

  • ❌ VLM in multi-vector mode (returns one vector per patch, not a single document vector). Different shape from current embed.embed.

Input

Output

  • ❌ Top-K ranking with late-interaction scores.

Storage

  • ❌ Weights cache.
  • Multi-vector storage in sqlite-vec (supported via custom SQL but not exposed cleanly through locara-storage).
  • App-side: page-level vectors (tens to hundreds per document).

Interaction (IPC + SDK)

  • search.documents({ query, corpus }) IPC, OR extend embed.embed to return multi-vector results + a separate search.rerank_late_interaction op.

Capabilities (manifest)

  • capabilities.fs.user-selected for input documents.
  • capabilities.models[] for the ColPali-class model.

Gaps

  • PDF rasterizer crate.
  • Multi-vector embedding mode (different IPC shape than current embed.embed).
  • Multi-vector storage primitive in locara-storage (sqlite-vec supports it via SQL but the SDK doesn’t expose it cleanly).

See also