Spotlight: Identifying and Localizing Video Generation Errors Using VLMs
As text-to-video models become more realistic, their errors are increasingly subtle, fine-grained, and localized in space and time. Spotlight evaluates whether current vision-language models can detect, localize, and explain these nuanced video-generation errors. The benchmark contains 600 videos from state-of-the-art T2V models including Veo3, Seedance, and LTX-2, with over 1,600 expert annotations spanning physics, semantics, and anatomy. Our experiments show that VLMs still lag far behind humans, with human performance nearly doubling the strongest baselines, highlighting the need for more robust perception and hallucination mitigation in automated video evaluation.
arXiv