`video-text-to-text` (video Q&A / description)

HF group: Multimodal · Status: ❌ not built

What it is

Video (+ optional text question) → text. “What happens in this clip?”, “When does the person sit down?”.

Open-weight models

Model	Params	Released	License	Quality	Notes
Qwen2.5-VL-7B	7 B	2025	Apache-2.0	Strong on short clips	1-FPS sampling typical.
MiniCPM-V 8B	8 B	2025	Apache-2.0	Lightweight video	Edge-friendly.
LLaVA-NeXT-Video-7B	7 B	2024	Apache-2.0	Reasonable	Older.
Qwen3.5-Omni	~30 B / 3 B active	2026-03	Apache-2.0	400+ s of 720p video at 1 FPS	Same model as the omni below.

Infrastructure required

Inference

❌ VLM inference for video (same model class as image-text-to-text).

Input

❌ Video input pipeline: frame sampler (typically 1 FPS), audio splitter for full-video understanding.
File picker via fs.user-selected.

Output

Streaming token Channel (text answer).

Storage

❌ Weights cache.
Per-session: extracted frames + audio embeddings cached on disk.

Interaction (IPC + SDK)

❌ video.ask({ path, question }) IPC.

Capabilities (manifest)

capabilities.fs.user-selected for input.
capabilities.models[] for the VLM.

Gaps

Video input pipeline (frame sampler, audio splitter).
VLM inference for video models.

See also