About SynthForge
Inspiration
SynthForge was born from a recurring problem I faced while building machine learning projects: access to usable data. Real-world datasets are often scarce, sensitive, or locked behind privacy and compliance constraints. At the same time, many early-stage developers and researchers don’t have the infrastructure or budget to rely on large-scale data collection or annotation pipelines. I wanted to explore whether careful statistics, heuristics, and lightweight AI could meaningfully reduce this barrier and make AI experimentation more accessible.
What I Learned
This project deepened my understanding of data-centric AI—specifically, how model performance is often constrained more by data quality than by model architecture. I learned how statistical distributions, variance control, and noise mechanisms (e.g., Laplace noise for differential privacy) affect downstream learning. I also gained hands-on experience in balancing privacy and utility, handling messy real-world data (NaNs, skewed distributions), and designing systems that work under real resource constraints rather than ideal conditions.
How I Built It
SynthForge is implemented as a lightweight Streamlit application with a modular data pipeline:
- Data ingestion: CSV/Excel uploads with sampling to maintain responsiveness.
- Synthetic generation: Column-wise synthesis using probabilistic sampling for categorical data and Gaussian modeling for numerical data, with variance control.
- Privacy layer: PII detection via regex and optional differential privacy using Laplace noise:
[ x' = x + \text{Laplace}\left(0, \frac{\Delta f}{\epsilon}\right) ]
- Auto-labeling: Rule-based sentiment analysis, binning, and clustering, with optional LLM-assisted labeling.
- Evaluation: Side-by-side statistical comparisons and downloadable reports.
The focus was on correctness, transparency, and fast iteration rather than black-box generation.
Challenges
The biggest challenge was maintaining statistical fidelity while ensuring privacy guarantees, especially with small or noisy datasets. Handling missing values and edge cases without corrupting distributions required careful design. Another challenge was deciding what not to build—keeping the MVP focused while resisting feature creep. Finally, designing for constrained environments forced me to optimize for simplicity, robustness, and clarity rather than brute-force computing.
SynthForge is an ongoing exploration into how far thoughtful engineering can go in democratizing access to high-quality training data for AI systems.
Log in or sign up for Devpost to join the conversation.