Video World Models with Long-term Spatial Memory

* Equal contribution

¹Stanford University ² Shanghai Jiao Tong University, ³ The Chinese University of Hong Kong, ⁴Shanghai Artificial Intelligence Laboratory ⁵S-Lab, Nanyang Technological University

Abstract

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

Data Construction

We use Mega-SaM to extract camera poses and dynamic point maps from the full video clip. For the source part, dynamic regions are erased via TSDF-Fusion, and the point cloud is rendered along the target trajectory to to serve as static geometry guidance for the target part. Qwen generates annotations for actions in future target frames.

Static Geometry-grounded Video Generation

A video game set in a dilapidated, sunlit street with old buildings and a horse-drawn carriage. Pedestrians in period clothing walk along the cobblestone path, while the environment suggests a historical or post-apocalyptic setting.

A cobblestone street with horse-drawn carriages and vintage architecture, suggesting an early 20th-century urban setting. The scene is overcast, with tram tracks running down the middle of the street, adding to the historical ambiance.

A dense, overgrown forest with tangled branches and moss-covered ground. Sunlight filters through the canopy, creating a serene yet mysterious atmosphere. The camera moves slowly, highlighting the intricate details of the natural environment.

A person walks along a sunlit path through a lush, green landscape, carrying a backpack. The scene is bathed in warm sunlight filtering through trees, creating a serene and adventurous atmosphere.

An aerial view of a snowy forest with a clear blue sky above. A vehicle is driving along a snow-covered road winding through the trees, creating a serene winter landscape.

A vibrant grocery store fruit section with neatly arranged fruits like bananas, pineapples, apples, and oranges. Price tags are visible on each shelf, indicating the cost of the produce. The scene is well-lit, highlighting the freshness and variety of the fruits.

An aerial view of a rural road with a single red car driving through a lush, green landscape. The road is flanked by trees and scattered houses, suggesting a peaceful countryside setting. The scene captures the tranquility and beauty of a quiet drive through nature.

A serene river flows through a lush forest, surrounded by dense greenery and rocky banks. The water cascades over smooth stones, creating a tranquil and picturesque scene of nature's beauty.

A first-person view of a player navigating through a grassy, overgrown area with stone structures and stairs. The environment appears to be part of a post-apocalyptic or abandoned setting, with muted colors and a cloudy sky. The player is holding a firearm, suggesting readiness for combat or exploration.

Two shirtless men stand face-to-face in a boxing ring, preparing for a match. The gym is equipped with punching bags and mirrors, suggesting a serious training session. Their focused expressions indicate readiness and determination.

An aerial view of a coastal area with a small, graffiti-covered building situated on a grassy hill. The surrounding landscape includes dense greenery and a sandy beach leading to the ocean under a clear blue sky.

A group of people ice skating on a sunny day at an outdoor rink. The skaters, dressed in winter gear, glide across the smooth ice under a clear sky with buildings and a mountainous backdrop. Shadows stretch long on the ice as the sun shines brightly.

Results

A rider on horseback traverses a narrow, grassy path through a mountainous landscape under an overcast sky. The terrain is rugged, with rocky outcrops and sparse vegetation, suggesting a remote or wilderness setting.

A horseback rider traversing through various urban environments, from cobblestone streets to modern cityscapes. The journey showcases diverse architectural styles and settings, with the rider moving steadily through each scene.

In a desolate, mountainous landscape, two hands emerge from the ground, holding glowing orbs. The scene suggests a mystical or supernatural occurrence, with the hands and orbs being the focal point against the bleak, overcast sky and barren terrain.

a first-person perspective of a character navigating through a rugged, desert-like landscape with red rock formations and sparse vegetation. The sky is partly cloudy, and the sun casts dynamic shadows, enhancing the vivid colors of the environment.

a journey through a forested area, transitioning from a sunny day to a snowy landscape. The path is lined with tall trees and rocky terrain, eventually leading to a rustic wooden cabin nestled among the snow-covered ground.

An aerial view of a bustling beachside promenade lined with palm trees, shops, and restaurants. Cars are parked along the street, and people stroll along the sidewalk. The ocean is visible in the background, adding to the scenic beauty of the location.

A sleek, futuristic car drives through a vibrant cityscape with palm trees and modern architecture. The vehicle navigates smooth streets, passing by other cars and urban landmarks under a clear blue sky.

A sleek black sports car drives on a scenic mountain highway, passing other vehicles and surrounded by lush greenery and towering mountains under a clear blue sky.

BibTeX

@misc{wu2025spmem, title={Video World Models with Long-term Spatial Memory}, author={Tong Wu and Shuai Yang and Ryan Po and Yinghao Xu and Ziwei Liu and Dahua Lin and Gordon Wetzstein}, year={2025}, eprint={2506.05284}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://za6ge7wbld.proxynodejs.usequeue.com/abs/2506.05284}, }