document-question-answering
HF group: Multimodal · Status: ❌ not built
What it is
PDF / scanned page + question → answer. Distinct from generic
image-text-to-text because layout,
tables, and stamps matter. Highly relevant for the DocVault
reference app.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Donut | 200 M | 2022 | MIT | OCR-free, fine-tunable | Encoder-decoder; produces structured JSON. |
| LayoutLMv3 | 130-340 M | 2022 | MIT | DocVQA SOTA among open small models (ANLS 83.4) | OCR + layout multimodal pipeline. |
| Pix2Struct | 280 M | 2023 | Apache-2.0 | Strong on infographics | High-res image-to-text. |
| Idefics2 | 8 B | 2024 | Apache-2.0 | End-to-end, no OCR pipeline | VLM-class; better than LayoutLM on free-form Qs. |
| Qwen2.5-VL-7B | 7 B | 2025 | Apache-2.0 | Top end-to-end open | Same model as image-text-to-text. |
Infrastructure required
Inference
- ❌ Encoder-decoder runtime (Donut, Pix2Struct) OR VLM runtime (Idefics2, Qwen2.5-VL). Neither built.
- 🟡 Compose-with-OCR fallback works today:
locara-vision-ocr→ text-to-text LLM.
Input
- ❌ PDF rasterizer crate (e.g.
pdfium-render) — one image per page. Biggest unique infrastructure ask for this modality. - File picker via
fs.user-selected.
Output
- Streaming token Channel for free-form answers, structured JSON for fielded extraction (Donut returns JSON).
Storage
- ❌ Weights cache.
- App-side: extracted answers + page snapshots persisted via
locara-storage.
Interaction (IPC + SDK)
- ❌
doc.ask({ path, question })IPC. - Page-level chunking + multi-page reasoning typically routed via
visual-document-retrievalfirst.
Capabilities (manifest)
capabilities.fs.user-selected: truefor the PDF.capabilities.models[]for the DocVQA model.
Gaps
- PDF rasterizer crate (shared with
visual-document-retrieval). - Encoder-decoder OR VLM runtime.
- DocVault reference app would benefit directly.
See also
image-text-to-textvisual-document-retrieval- Crate:
locara-vision-ocr(OCR fallback) - Index:
../modalities-and-models-survey.md