Video World Models with Long-term Spatial Memory

* Equal contribution
1Stanford University 2 Shanghai Jiao Tong University, 3 The Chinese University of Hong Kong, 4Shanghai Artificial Intelligence Laboratory 5S-Lab, Nanyang Technological University

TL;DR Our video generation framework relies on long-term spatial memory, working memory, and sparse episodic memory.

Abstract

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

Data Construction

We use Mega-SaM to extract camera poses and dynamic point maps from the full video clip. For the source part, dynamic regions are erased via TSDF-Fusion, and the point cloud is rendered along the target trajectory to to serve as static geometry guidance for the target part. Qwen generates annotations for actions in future target frames.

Static Geometry-grounded Video Generation

Results

A rider on horseback traverses a narrow, grassy path through a mountainous landscape under an overcast sky. The terrain is rugged, with rocky outcrops and sparse vegetation, suggesting a remote or wilderness setting.

A horseback rider traversing through various urban environments, from cobblestone streets to modern cityscapes. The journey showcases diverse architectural styles and settings, with the rider moving steadily through each scene.

In a desolate, mountainous landscape, two hands emerge from the ground, holding glowing orbs. The scene suggests a mystical or supernatural occurrence, with the hands and orbs being the focal point against the bleak, overcast sky and barren terrain.

a first-person perspective of a character navigating through a rugged, desert-like landscape with red rock formations and sparse vegetation. The sky is partly cloudy, and the sun casts dynamic shadows, enhancing the vivid colors of the environment.

a journey through a forested area, transitioning from a sunny day to a snowy landscape. The path is lined with tall trees and rocky terrain, eventually leading to a rustic wooden cabin nestled among the snow-covered ground.

An aerial view of a bustling beachside promenade lined with palm trees, shops, and restaurants. Cars are parked along the street, and people stroll along the sidewalk. The ocean is visible in the background, adding to the scenic beauty of the location.

A sleek, futuristic car drives through a vibrant cityscape with palm trees and modern architecture. The vehicle navigates smooth streets, passing by other cars and urban landmarks under a clear blue sky.

A sleek black sports car drives on a scenic mountain highway, passing other vehicles and surrounded by lush greenery and towering mountains under a clear blue sky.

BibTeX

@misc{wu2025spmem,
      title={Video World Models with Long-term Spatial Memory}, 
      author={Tong Wu and Shuai Yang and Ryan Po and Yinghao Xu and Ziwei Liu and Dahua Lin and Gordon Wetzstein},
      year={2025},
      eprint={2506.05284},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://za6ge7wbld.proxynodejs.usequeue.com/abs/2506.05284}, 
}

We thank Nerfies for providing this amazing project template.