TL;DR
Diffusion models have a closed-form optimal denoiser, but sampling with it merely memorizes the training data. Deep networks instead generalize and produce novel images. What bridges this gap? Locality. Prior work attributed it to CNN architecture. We show it emerges from the statistics of the data itself.

The Problem: Memorization vs. Generalization
Diffusion models are trained to reverse a noising process — given a noisy image, predict the clean original. Remarkably, the training objective has a known closed-form minimizer: the optimal denoiser, which computes the posterior mean E[x₀|xₜ]. This makes diffusion models unique among generative models — we can write down the “perfect” denoiser analytically.
The optimal denoiser, however, when plugged into the diffusion sampling process, simply memorizes the training data. Its score function points directly toward individual training images, so sampling just reproduces them rather than generating anything new.
Trained deep networks, on the other hand, generalize: they produce novel, sharp images that are not in the training set. What property of these networks enables this generalization?
Locality is the Key
Previous work shows that locality (i.e. limited sensitivity field) of diffusion models is a key to generalization – check the works by Kamb and Ganguli, and Niedoba et al.
Let's see how much each input pixel affects one output pixel in the center of the image (sensitivity of a center pixel) – the model learns to rely only on a limited neighborhood of the image, especially closer to the data. Here left is more noise, right is less noise.

Since locality is central to generalization, it is crucial to understand why diffusion models are local in the first place – this is the key question asked in our paper. Previously it was believed that the locality is caused by the inductive bias in architectures, i.e. convolutions.
Surprisingly, we find that architecture barely matters: completely different models exhibit almost identical locality patterns on the same data: Unet, DiT, and even a simple linear denoiser.

Why does locality look like this?
While the locality is similar across models, we show that the locality patterns change a lot depending on the dataset. We show that the locality learned by the diffusion models is well-explained by the principal components of the dataset. This leads to some curious effects.
For natural images, principal components of the data approximate the Fourier basis, and thus the learned locality is compact and roughly isotropic.
For more specialized datasets, like CelebaA-HQ, we can see faces appearing in the sensitivity fields of the trained diffusion models. Now, locality is no longer compact, nor is it isotropic.
Note, the locality on the right is emerging in the sensitivity of the trained diffusion model!

Moreover, we can even manipulate the learned locality field by changing the pixel correlation pattern across the dataset.

Results
Building on these insights, we construct an analytical denoiser grounded in data statistics rather than assumed architectural biases. It better matches the scores predicted by a trained UNet than the prior expert-crafted analytical alternatives and produces higher-quality generated samples.

Explore Further
Interested in learning more? Read the full paper, explore our codebase, or try out the interactive demo.
Citation
If you find this work useful, please cite:
@inproceedings{lukoianovlocality,
title={Locality in Image Diffusion Models Emerges from Data Statistics},
author={Lukoianov, Artem and Yuan, Chenyang and Solomon, Justin and Sitzmann, Vincent},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
primaryClass={cs.CV},
url={https://locality.lukoianov.com/},
}