Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, Tariq Iqbal

Paper arXiv Code Blog

HF Paper Tweet Video NEW EBT Works

TL;DR

Energy-Based Transformers (EBTs) outscale (feed-forward) transformers while generalizing reasoning/system 2 thinking to any modality/problem without requiring verifiable rewards! EBTs are the first approach to outscale feed-forward transformers across modalities and with respect to several axes including data, depth, parameters, FLOPs, etc. EBTs can also think over every single prediction being made (i.e. every token in language modeling) and generalize better than existing models.

Energy Based Transformer Video Prediction as Thinking

Energy Based Transformer Language Model Prediction as Thinking

Thinking Processes visualized as energy minimization for autoregressive EBTs. Initially, a random prediction is fed into the models, causing the EBT to predict high energy. Then, models iteratively refine these predictions by minimizing the energy through gradient descent (passing the gradient from the energy to predictions). This process is performed until convergence of the energy, enabling the model to know "when to stop thinking."

Abstract

For more information please reference the blog post, code, or paper!

BibTeX

@misc{gladstone2025energybasedtransformersscalablelearners,
  title={Energy-Based Transformers are Scalable Learners and Thinkers},
  author={Alexi Gladstone and Ganesh Nanduru and Md Mofijul Islam and Peixuan Han and Hyeonjeong Ha and Aman Chadha and Yilun Du and Heng Ji and Jundong Li and Tariq Iqbal},
  year={2025},
  eprint={2507.02092},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2507.02092}
}

Energy-Based Transformers are Scalable Learners and Thinkers

TL;DR

Abstract

For more information please reference the blog post, code, or paper!

BibTeX Copy

BibTeX