Abstract
We present PhET‑Physics‑VideoQA, a controlled benchmark for assessing physics understanding in vision‑language models from video. The corpus comprises short clips sourced from PhET Interactive Simulations, covering topics across mechanics & fluids, optics, electromagnetism & circuits, and quantum mechanics. Each clip is paired with a triad of expert‑validated questions—conceptual, numerical, and error‑detection—yielding a diverse test‑bed for pixel‑grounded reasoning.
Dataset
Use the BibTeX in the References section or download the full entry: PHET‑PHYSICS‑VIDEOQA.bib.
All assets released for research. Please keep links to the dataset and code when you reference this work.
Results (LLM‑as‑a‑judge; 1–5 scale)
| Category | Question Type | GPT‑4o‑mini | Gemini‑2.5‑Flash‑Lite | Qwen‑VL‑Plus | Type Avg. |
|---|---|---|---|---|---|
| Mechanics & Fluids | Conceptual | 4.6 | 4.5 | 2.3 | 3.80 |
| Mechanics & Fluids | Error Detection | 3.0 | 3.2 | 2.5 | 2.90 |
| Mechanics & Fluids | Numerical | 4.0 | 4.2 | 2.1 | 3.43 |
| Quantum Mechanics | Conceptual | 3.7 | 3.8 | 1.6 | 3.03 |
| Quantum Mechanics | Error Detection | 2.4 | 2.5 | 1.5 | 2.13 |
| Quantum Mechanics | Numerical | 3.3 | 3.5 | 1.6 | 2.80 |
| E&M & Circuits | Conceptual | 4.7 | 4.6 | 3.3 | 4.20 |
| E&M & Circuits | Error Detection | 3.8 | 3.4 | 3.1 | 3.43 |
| E&M & Circuits | Numerical | 4.2 | 4.0 | 3.2 | 3.80 |
| Optics | Conceptual | 4.2 | 4.2 | 3.7 | 4.03 |
| Optics | Error Detection | 2.6 | 3.3 | 2.3 | 2.73 |
| Optics | Numerical | 4.6 | 4.3 | 3.9 | 4.27 |
Video Samples
References & Citation
Download BibTeX: PHET‑PHYSICS‑VIDEOQA.bib
