Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision–Language–Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts.
We introduce EBT-Policy, a new energy-based architecture that solves core issues in robotic and real-world settings. EBT-Policy consistently outperforms diffusion-based policies across simulated and real-world tasks, while requiring significantly less training and inference computation.
Explaining Uncertainty Modeling. Twelve frames are grouped into three phases: (1) Tool Insertion, (2) Hook Hanging Attempt, and (3) Recovery & Successful Retry. The color bar beneath each frame encodes per-frame energy predicted by the model, where a lower energy indicates higher certainty in EBT-Policy. Notably, red (Step 7) marks the failure that triggers an EBT-Policy retry, while green (Step 11) marks the successful correction. Together, these steps highlight EBT-Policy's interpretability and physical reasoning: using energy-based uncertainty to decide whether to continue or retry and how to adjust actions.
Explaining Energy Minimization. EBT-Policy receives inputs (RGB frames, robotic proprioception, and language instructions) and assigns an energy to candidate action trajectories. Starting from a noisy initialization, the trajectory is iteratively updated by gradient descent on this energy, yielding starting states to a final executable plan. Optimization terminates when the energy converges to a minimum, as illustrated by the energy-landscape sketch.
Success Rates During Training. EBT-Policy exhibits rapid performance improvement, reaching 100% success by epoch 30, using just 2 iterations for predicting actions. Diffusion Policy (DP), on the other hand, only reaches a 100% success rate after 90 epochs, and uses 50 times more steps than EBT-Policy at inference, demonstrating how EBT-Policy is more efficient than DP during both training and inference.
| Property | EBT-Policy-S | EBT-Policy-R |
|---|---|---|
| Size | ~30M | ~100M |
| Task | Simulation | Real World |
| Language Encoder | N/A | T5-S |
| Vision Encoder | ResNet-18 | DINOv3-S |
Comparison of EBT-Policy variants. EBT-Policy-S is a compact Transformer used for controlled simulation studies, while EBT-Policy-R is a larger multimodal variant designed for real-world, language-conditioned, and multitask policy learning.
@misc{davies2025ebtpolicyenergyunlocksemergent,
title={EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities},
author={Travis Davies and Yiqi Huang and Alexi Gladstone and Yunxin Liu and Xiang Chen and Heng Ji and Huxian Liu and Luhui Hu},
year={2025},
eprint={2510.27545},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2510.27545},
}