TL;DR

TLDR: We outscale (feed-forward) transformers while generalizing reasoning/system 2 thinking to any modality/problem without requiring verifiable rewards! Energy-Based Transformers (EBTs) are the first approach to outscale feed-forward transformers across modalities and with respect to several axes including data, depth, parameters, FLOPs, etc. EBTs can also think over every single prediction being made (i.e. every token in language modeling) and generalize better than existing models.

Energy Based Transformer Video Prediction as Thinking
Energy Based Transformer Language Model Prediction as Thinking

Thinking Processes visualized as energy minimization for autoregressive EBTs. Initially, a random prediction is fed into the models, causing the EBT to predict high energy. Then, models iteratively refine these predictions by minimizing the energy through gradient descent (passing the gradient from the energy to predictions). This process is performed until convergence of the energy, enabling the model to know "when to stop thinking."

Abstract

For more information please reference the blog post, code, or paper!

BibTeX

TODO add BibTex Code Here