Joint Video-Action Optimization -- learns a unified latent space for both video and action generation.
Decoupled Video-Action Decoding -- speeds up policy inference by skipping video generation.
Masked Training -- enables a single model to handle diverse tasks while reducing overfitting.
For fair comparisons between methods, all training data are from public datasets and no additional data are used. All evaluation experiments are out-of-distribution.
All evaluations are unseen during training, including unseen environments, objects, backgrounds, and robot grippers.
We compare UVA with the state-of-the-art policy model, Diffusion Policy with a pretrained vision encoder released by UMI (DP-UMI). UVA outperforms DP-UMI in multi-task settings.
Using video generation as an additional supervision during training can significantly boost policy inference performance, without slowing down policy inference speed.
Results on validation set. UVA generates high-quality videos that closely match the ground truth. In contrast, UniPi occasionally produces blurry or mismatched images and may fail to generate some objects (Libero10: the second moka pot). In our experiments, we predict 4 future video frames. However, the UVA framework can also be extended for longer video predictions with more compute resources.
Our model can perform forward dynamics predictions to generate future observations based on action inputs. We use it to predict future observations to guide the behavior of a pretrained policy model, such as the Diffusion Policy (DP). During training, the robot pushes two blocks randomly to any target. During testing, the generated future image from UVA is used to select the proper action that moves a specific object to a specific target.
DP alone achieves an 38% average success rate, while incorporating our model to generate future observations for trajectory selection increases the success rate to 60%. Using a ground-truth simulator provides an upper bound success rate of 75%.
Our model can be used for inverse dynamics predictions on unseen data. Given observations, UVA can predict the actions that cause the visual changes. ORB-SLAM3 is a well-engineered SLAM system used in UMI. While SLAM achieves the best accuracy, it requires careful calibration and map-building.
Are there any tips for training UVA?
We found that two-stage training works better than training on both video and action tasks simultaneously. In the first stage, the model is trained on video generation, and in the second stage, it is fine-tuned on both video and action tasks.
How long does it take to train UVA?
Training time depends on both the size of the dataset and the complexity of the task. For the UMI task, we sampled 500 trajectories from each of the three datasets and trained the model using 8 H100 GPUs. The video generation task was trained for 2 days, while the joint video and action generation requires an additional 2 days.
What's the next step for UVA?
We believe there is still significant potential in UVA that remains unexplored, and we leave this for future work.
Additional video data: UVA can leverage large amounts of actionless video data, which could provide valuable additional supervision. We plan to pretrain UVA on additional video data in the future.
Multi-modality: UVA can be naturally extended to predict modalities beyond video and action, such as sound and force, by incorporating additional diffusion heads, offering a more comprehensive and versatile framework.
Better architecture: The model architecture can be futuer improved by replacing the diffusion heads with flow matching.
Larger model size: UVA's performance may currently be limited by the model size. We plan to explore larger models in the future.
The authors would like to thank Huy Ha for his valuable advice on video recording and website design. We also thank Amber Xie, Austin Patel, Jennifer Grannen, Vincent de Bakker, John So, Max Du, and Vidhi Jain for their important feedbacks on the paper draft. We are grateful to Mengda Xu, Suneel Belkhale, Xiaomeng Xu, Fanqi Lin, Lirui Wang, and Tianhong Li for helpful discussions. We would like to express our gratitude to Chi Cheng, Zhenjia Xu, Chuer Pan, Zeyi Liu, Huy Ha, Fanqi Lin, Yingdong Hu, and Zhaxizhuoma for their contributions to the shared UMI dataset. Finally, we want to thank everyone who contributed their computing resources to help us train the models.
This work was supported in part by the Toyota Research Institute, NSF Award #1941722, #2143601, #2037101, #2132519, ONR Project #N00014-22-1-2293 and the DARPA TIMAT project. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
@inproceedings{li2025uva,
title={Unified Video Action Model},
author={Shuang Li and Yihuai Gao and Dorsa Sadigh and Shuran Song},
year={2025},
booktitle={arxiv},
}