Off Policy Diffusion Samplers

Diffusion models have emerged as effective distribution estimators in vision, language, and reinforcement learning, but their use as priors in downstream tasks poses an intractable posterior inference problem. This paper studies amortized sampling of the posterior over data, $\mathbf{x}\sim p^\text{post}(\mathbf{x})\propto p(\mathbf{x})r(\mathbf{x})$, in a model that consists of a diffusion generative model prior p($\mathbf{x}$) and a black-box constraint or likelihood function $r(\mathbf{x})$. We state and prove the asymptotic correctness of a data-free learning objective, relative trajectory balance, for training a diffusion model that samples from this posterior, a problem that existing methods solve only approximately or in restricted cases. Relative trajectory balance arises from the generative flow network perspective on diffusion models, which allows the use of deep reinforcement learning techniques to improve mode coverage. Experiments illustrate the broad potential of unbiased inference of arbitrary posteriors under diffusion priors: in vision (classifier guidance), language (infilling under a discrete diffusion LLM), and multimodal data (text-to-image generation). Beyond generative modeling, we apply relative trajectory balance to the problem of continuous control with a score-based behavior prior, achieving state-of-the-art results on benchmarks in offline reinforcement learning.

Given a diffusion model prior $ p(\mathbf{x}) $ and a black-box likelihood function $ r(\mathbf{x}) $, our goal is to sample from the posterior $ p^{\text{post}}(\mathbf{x}) \propto p(\mathbf{x}) r(\mathbf{x}) $. Conventional approaches often rely on heuristic guidance, leading to bias or restricted applicability. In contrast, we derive a principled, unbiased objective for posterior sampling, rooted in the Generative Flow Network (GFlowNet) perspective, which ensures improved mode coverage and asymptotic correctness without requiring data or approximations.

$$L(x,\tau; \phi) = \left(\log\frac{Z_\phi\prod_{i=1}^np_{F,\phi}^{{\rm post}}(s_k\mid s_{k-1})}{r(x)\prod_{i=1}^{n}p_{F,\theta}(s_{k}\mid s_{k-1})}\right)^2$$

Here, $ Z_{\phi} $ is a learnable normalization constant. Satisfying the RTB constraint (minimizing loss to 0) for all diffusion trajectories facilitates unbiased sampling from the desired posterior distribution $ p^{\text{post}}(\mathbf{x}) \propto p_\theta(\mathbf{x}) r(\mathbf{x}) $.

Off-Policy Exploration

We are free to choose off-policy diffusion trajectories to optimize the RTB objective, which facilitates improved exploration and mode coverage. In particular, useful strategies include the use of replay buffers and local search.

BibTeX


      @inproceedings{
        venkatraman2024amortizing,
        title={Amortizing intractable inference in diffusion models for vision, language, and control},
        author={Siddarth Venkatraman and Moksh Jain and Luca Scimeca and Minsu Kim and Marcin Sendera and Mohsin Hasan and Luke Rowe and Sarthak Mittal and Pablo Lemos and Emmanuel Bengio and Alexandre Adam and Jarrid Rector-Brooks and Yoshua Bengio and Glen Berseth and Nikolay Malkin},
        booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
        year={2024},
        url={https://openreview.net/forum?id=gVTkMsaaGI}
      }

Amortizing intractable inference in diffusion models for vision, language, and control

Abstract

Intractable inference in diffusion models

Setup

Relative trajectory balance

Off-Policy Exploration

Empirical Results

Unconditional Image

Text-to-image

Diffusion language models

Offline RL

BibTeX