Can Vision-Language Models Solve the Shell Game?

Tiedong Liu, Wee Sun Lee
National University of Singapore

Cups Game

Track a ball hidden beneath visually identical opaque cups that undergo positional swaps.

Play the Cups Game

Cards Game

Track a target card after being flipped face-down and shuffled.

Play the Cards Game

Visual Entity Tracking Benchmark (VET-Bench) is a synthetic diagnostic benchmark simulating the realistic shell game with visually indistinguishable objects that forces models to track entities through spatiotemporal continuity. The task is easy for human but difficult for current VLMs. State-of-the-art VLMs perform at random chance, while our proposed Molmo2-SGCoT achieves over 90% accuracy.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by finetuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy above 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools.

VET-Bench

Overview of VET-Bench: a synthetic diagnostic testbed for visual entity tracking in VLMs. The benchmark features cups game and cards game tasks where visually identical objects are shuffled.

VET-Bench is a synthetic diagnostic benchmark designed to isolate spatiotemporal perception from frame-level appearance cues. Videos are rendered using three.js, supporting full synthetic variation in color, material, texture, lighting, and camera viewpoint. VET-Bench ensures that no single frame reveals either (i) the target's identity or (ii) the swap operation, forcing VLMs to rely exclusively on fine-grained spatiotemporal perception across frames.

Results

Bar chart showing VLM performance on VET-Bench. All models perform near random chance (0.33) while Molmo2-SGCoT achieves over 0.90 accuracy.

Performance of 16 state-of-the-art VLMs on VET-Bench, consisting of 50 cups-game and 50 cards-game videos featuring 3 objects and 5 swaps (~12 seconds). All existing VLMs perform near the random guessing baseline (33%). Our fine-tuned Molmo2-SGCoT achieves over 90% accuracy.

Accuracy vs. swap count: performance drops substantially after just one swap and converges to random guessing.

(a) Accuracy vs. Swap Count

Accuracy vs. object count: even at N=2, models fail to outperform random baseline, with accuracy scaling as 1/N.

(b) Accuracy vs. Object Count

Swap count: For non-shuffled episodes, most models achieve near-perfect accuracy. Performance drops substantially with just one swap and quickly converges to random guessing.

Object count: Even at N=2 (the simplest case), models fail to significantly outperform the random baseline. Accuracy scales as 1/N, indicating that models resort to random guessing rather than genuine entity tracking.

Comparison with Existing Benchmarks

Visual Shortcuts in the Perception Test

The Perception Test includes shell-game-style clips, but many contain appearance cues that bypass the need for tracking. These include distinctive cups that allow re-identification by appearance, and transparent cups that directly reveal the target.

Example frames from a video with distinct cups in the Perception Test.
Example frames from another video with distinct cups in the Perception Test.

Example frames from videos involving distinct cups in the Perception Test.

Example frames from a video with transparent cups in the Perception Test.
Example frames from another video with transparent cups in the Perception Test.

Example frames from videos involving transparent cups in the Perception Test.

Visual shortcut in video #7632: cut frames at the end reveal the answer.
Visual shortcut in video #9896: cut frames at the end reveal the answer.

Visual shortcuts: unedited cut frames at the end reveal the answer by showing the cups being lifted.

Comparison with VideoReasonBench

VideoReasonBench includes cups-game-like tasks, but the swap operations are explicitly indicated by arrows overlaid on the frames. These visual annotations effectively serve as symbolic "swap tokens", allowing models to reason about state transitions from static in-frame cues. In contrast, VET-Bench has no such frame-level cues—correctly solving the task requires exploiting spatiotemporal continuity across frames.

Cups-game task in VideoReasonBench: swap operations are explicitly indicated by red arrows overlaid on the frames.

VideoReasonBench

Cups-game task in VET-Bench: no frame-level cues are provided, requiring spatiotemporal tracking.

VET-Bench

VideoReasonBench provides explicit frame-level cues (red arrows) highlighting swap operations, which are absent in VET-Bench.

Spatiotemporal Grounded Chain-of-Thought (SGCoT)

We propose SGCoT, which leverages Molmo2's native object tracking capability to generate spatiotemporal grounded object trajectories as intermediate states before providing the final answer.

Prompt Track the Queen of Hearts and answer where is the Queen of Hearts at the end of the video?
Response <tracks coords=\"0.0 1 500 504;0.5 1 500 504;1.0 1 500 504;1.5 1 500 431;2.0 1 571 306;2.5 1 664 293;3.0 1 735 389;3.5 1 750 504;4.0 1 750 504;4.5 1 750 504;5.0 1 750 504;5.5 1 750 504;6.0 1 750 504;6.5 1 750 504;7.0 1 750 504;7.5 1 750 480;8.0 1 704 330;8.5 1 614 280;9.0 1 532 352;9.5 1 500 493;10.0 1 539 339;10.5 1 630 280;11.0 1 710 341;11.5 1 750 504;12.0 1 750 504">the Queen of Hearts</tracks>Answer: right.

BibTeX

@misc{liu2026visionlanguagemodelssolveshell,
      title={Can Vision-Language Models Solve the Shell Game?}, 
      author={Tiedong Liu and Wee Sun Lee},
      year={2026},
      eprint={2603.08436},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08436}, 
}