Locara

visual-question-answering

HF alias — see image-text-to-text.

VQA is a sub-task of image-text-to-text — same VLM models, same infrastructure. HuggingFace lists it separately for historical reasons; from a tooling perspective Locara doesn’t need a separate modality entry.

For document-specific VQA where layout matters, see document-question-answering.