Inspiration

Real patient data is locked behind HIPAA. Synthetic data promises a privacy-safe
stand-in — but only if it's actually statistically faithful. We wanted to put that
to a rigorous test: train four generative models on 124k patient records and measure
exactly how much utility is lost when you swap real data for synthetic.

## What we built

Synthara — an end-to-end benchmark pipeline that trains four generators (Gaussian Copula, CTGAN, TVAE, and a custom diffusion model) on Synthea EHR data, then evaluates each on realism, downstream ML utility, and privacy leakage.

The diffusion model is a 3.34M parameter MLP denoiser with sinusoidal timestep
embeddings, trained via a 200-step linear noise schedule:

$$q(x_t | x_{t-1}) = \mathcal{N}!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I\right)$$

## Results

TabDDPM trained on synthetic-only data reached AUROC 0.9906 on real patient
mortality prediction — vs. 0.9926 for real data. On correlation distance (joint feature relationships), diffusion scored 0.503 vs. 1.65–3.25 for every other
generator. Privacy passed clean across all four models.

## Challenges

Zero-inflated distributions — 90%+ of patients have zero inpatient visits,
a spike-at-zero that no generator could replicate. It kept discriminator AUROC pinned at 1.0 regardless of every other improvement.

Diffusion outputs continuous values for binary columns — including the
DECEASED target. Passing fractional targets to XGBoost silently breaks classification. Required identifying and snapping all binary columns post-generation.

## What we learned

Diffusion models don't just win on one metric — they preserve the joint distribution of features. If the model learns which features co-vary (BMI and diabetes, age and inpatient visits), downstream classifiers can exploit those
relationships just as they would with real data. That's why 0.002 AUROC gap closes so completely.

Share this project:

Updates