Inspiration
Real patient data is locked behind HIPAA. Synthetic data promises a privacy-safe
stand-in — but only if it's actually statistically faithful. We wanted to put that
to a rigorous test: train four generative models on 124k patient records and measure
exactly how much utility is lost when you swap real data for synthetic.
## What we built
Synthara — an end-to-end benchmark pipeline that trains four generators (Gaussian Copula, CTGAN, TVAE, and a custom diffusion model) on Synthea EHR data, then evaluates each on realism, downstream ML utility, and privacy leakage.
The diffusion model is a 3.34M parameter MLP denoiser with sinusoidal timestep
embeddings, trained via a 200-step linear noise schedule:
$$q(x_t | x_{t-1}) = \mathcal{N}!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\, \beta_t I\right)$$
## Results
TabDDPM trained on synthetic-only data reached AUROC 0.9906 on real patient
mortality prediction — vs. 0.9926 for real data. On correlation distance (joint
feature relationships), diffusion scored 0.503 vs. 1.65–3.25 for every other
generator. Privacy passed clean across all four models.
## Challenges
Zero-inflated distributions — 90%+ of patients have zero inpatient visits,
a spike-at-zero that no generator could replicate. It kept discriminator AUROC
pinned at 1.0 regardless of every other improvement.
Diffusion outputs continuous values for binary columns — including the
DECEASED target. Passing fractional targets to XGBoost silently breaks
classification. Required identifying and snapping all binary columns post-generation.
## What we learned
Diffusion models don't just win on one metric — they preserve the joint
distribution of features. If the model learns which features co-vary (BMI and
diabetes, age and inpatient visits), downstream classifiers can exploit those
relationships just as they would with real data. That's why 0.002 AUROC gap closes
so completely.
Built With
- amazon-web-services
- matplotlib
- numpy
- pandas
- python
- pytorch
- sagemaker
- scikit-learn
- scipy
- sdv
- xgboost
Log in or sign up for Devpost to join the conversation.