SPIKE-RL: Video-LLMs meet Bayesian Surprise
Real-world videos often contain routine activity punctuated by surprising moments, but standard Video-LLM inference commonly samples frames uniformly and can miss the moments that define a video’s narrative. SPIKE is an inference-time framework that measures Bayesian Surprise as the belief update triggered by new visual evidence, identifying when incoming frames conflict with prior beliefs. SPIKE-RL further improves these belief hypotheses with GRPO using rewards from video captions. Together, SPIKE and SPIKE-RL enable query-agnostic, surprise-weighted frame sampling that allocates more visual context to important moments, improving performance across five downstream video benchmarks.
arXiv