video-text-to-text (video Q&A / description)
HF group: Multimodal · Status: ❌ not built
What it is
Video (+ optional text question) → text. “What happens in this clip?”, “When does the person sit down?”.
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 7 B | 2025 | Apache-2.0 | Strong on short clips | 1-FPS sampling typical. |
| MiniCPM-V 8B | 8 B | 2025 | Apache-2.0 | Lightweight video | Edge-friendly. |
| LLaVA-NeXT-Video-7B | 7 B | 2024 | Apache-2.0 | Reasonable | Older. |
| Qwen3.5-Omni | ~30 B / 3 B active | 2026-03 | Apache-2.0 | 400+ s of 720p video at 1 FPS | Same model as the omni below. |
Infrastructure required
Inference
- ❌ VLM inference for video (same model class as
image-text-to-text).
Input
- ❌ Video input pipeline: frame sampler (typically 1 FPS), audio splitter for full-video understanding.
- File picker via
fs.user-selected.
Output
- Streaming token Channel (text answer).
Storage
- ❌ Weights cache.
- Per-session: extracted frames + audio embeddings cached on disk.
Interaction (IPC + SDK)
- ❌
video.ask({ path, question })IPC.
Capabilities (manifest)
capabilities.fs.user-selectedfor input.capabilities.models[]for the VLM.
Gaps
- Video input pipeline (frame sampler, audio splitter).
- VLM inference for video models.