NeurIPS 2025 Spotlight

Locality in Image Diffusion Models
Emerges from Data Statistics

Artem Lukoianov1,Chenyang Yuan2,Justin Solomon1,Vincent Sitzmann1

1MIT    2Toyota Research Institute

TL;DR

Diffusion models have a closed-form optimal denoiser, but sampling with it merely memorizes the training data. Deep networks instead generalize and produce novel images. What bridges this gap? Locality. Prior work attributed it to CNN architecture. We show it emerges from the statistics of the data itself.

Samples generated by our analytical (training-free) denoiser compared to a trained UNet given the same initial noise level.
Samples generated by our analytical (training-free) denoiser compared to a trained UNet given the same initial noise level.

The Problem: Memorization vs. Generalization

Diffusion models are trained to reverse a noising process — given a noisy image, predict the clean original. Remarkably, the training objective has a known closed-form minimizer: the optimal denoiser, which computes the posterior mean E[x₀|xₜ]. This makes diffusion models unique among generative models — we can write down the “perfect” denoiser analytically.

The optimal denoiser, however, when plugged into the diffusion sampling process, simply memorizes the training data. Its score function points directly toward individual training images, so sampling just reproduces them rather than generating anything new.

Trained deep networks, on the other hand, generalize: they produce novel, sharp images that are not in the training set. What property of these networks enables this generalization?

Locality is the Key

Previous work shows that locality (i.e. limited sensitivity field) of diffusion models is a key to generalization – check the works by Kamb and Ganguli, and Niedoba et al.

Let's see how much each input pixel affects one output pixel in the center of the image (sensitivity of a center pixel) – the model learns to rely only on a limited neighborhood of the image, especially closer to the data. Here left is more noise, right is less noise.

Sensitivity of the trained UNet (norm of the gradient of a center pixel with respect to all input pixels) reveals local effective receptive fields — each output pixel depends mainly on nearby inputs. Here images on the left correspond to more noise, on the right less noise.
Sensitivity of the trained UNet (norm of the gradient of a center pixel with respect to all input pixels) reveals local effective receptive fields — each output pixel depends mainly on nearby inputs. Here images on the left correspond to more noise, on the right less noise.

Since locality is central to generalization, it is crucial to understand why diffusion models are local in the first place – this is the key question asked in our paper. Previously it was believed that the locality is caused by the inductive bias in architectures, i.e. convolutions.

Surprisingly, we find that architecture barely matters: completely different models exhibit almost identical locality patterns on the same data: Unet, DiT, and even a simple linear denoiser.

Locality patterns are the same for different models: Unet, DiT, and even a simple linear denoiser.
Locality patterns are the same for different models: Unet, DiT, and even a simple linear denoiser.

Why does locality look like this?

While the locality is similar across models, we show that the locality patterns change a lot depending on the dataset. We show that the locality learned by the diffusion models is well-explained by the principal components of the dataset. This leads to some curious effects.

For natural images, principal components of the data approximate the Fourier basis, and thus the learned locality is compact and roughly isotropic.

For more specialized datasets, like CelebaA-HQ, we can see faces appearing in the sensitivity fields of the trained diffusion models. Now, locality is no longer compact, nor is it isotropic.

Note, the locality on the right is emerging in the sensitivity of the trained diffusion model!

Image

Moreover, we can even manipulate the learned locality field by changing the pixel correlation pattern across the dataset.

Image
Here we add a 'W'-shaped correlation pattern to the dataset, train a diffsuion model and visualize its learned sensitivity. The learned locality is no longer compact, nor is it isotropic.

Results

Building on these insights, we construct an analytical denoiser grounded in data statistics rather than assumed architectural biases. It better matches the scores predicted by a trained UNet than the prior expert-crafted analytical alternatives and produces higher-quality generated samples.

Quantitative results showing our analytical denoiser outperforms prior training-free baselines.
Our data-statistics-based analytical denoiser better matches the scores of a trained deep diffusion model and produces higher-quality samples compared to prior analytical baselines.

Explore Further

Interested in learning more? Read the full paper, explore our codebase, or try out the interactive demo.

Citation

If you find this work useful, please cite:

@inproceedings{lukoianovlocality,
    title={Locality in Image Diffusion Models Emerges from Data Statistics},
    author={Lukoianov, Artem and Yuan, Chenyang and Solomon, Justin and Sitzmann, Vincent},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    primaryClass={cs.CV},
    url={https://locality.lukoianov.com/},
}