visual-question-answering
HF alias — see
image-text-to-text.
VQA is a sub-task of image-text-to-text — same VLM models,
same infrastructure. HuggingFace lists it separately for
historical reasons; from a tooling perspective Locara doesn’t
need a separate modality entry.
For document-specific VQA where layout matters, see
document-question-answering.