Inspiration
In machine learning and data science, the most important factor is not the model, it’s the data. A simple model trained on clean, well-understood data will almost always outperform a sophisticated model trained on noisy, inconsistent, or poorly prepared datasets. Despite this, data cleaning remains one of the most painful and inaccessible parts of the ML workflow. The problem isn’t a lack of motivation, it’s complexity. To properly prepare data, you’re expected to: Be fluent in tools like pandas, NumPy, and scikit-learn Understand statistical concepts such as outliers, distributions, and imputation strategies Know how to diagnose issues like leakage, cardinality explosions, or schema inconsistencies Make decisions that directly affect model performance, often without feedback For many developers, students, and even experienced engineers, this becomes a bottleneck. People either: Skip proper validation entirely Rely on copy-pasted cleaning scripts Or blindly apply transformations without understanding their impact We wanted to change that. DataSmith was built to make high-quality data preparation easy, transparent, and accessible to anyone. Our goal was to create a developer tool that: Ensures users are working with good datasets Explains why certain cleaning decisions are made Can be dropped into any workflow with a single, linear command Whether it’s a website backend, a research notebook, a production pipeline, or a small script, DataSmith is designed to be flexible for any use case while remaining simple enough for anyone to use.
What It Does
DataSmith is an Agentic-powered data cleaning pipeline that prepares datasets for machine learning using coordinated decision-making agents. With a single command:
python main.py --kaggle "ethereum-dataset" --ai \
--output clean.csv \
--instructions "Remove duplicate rows"
And here is the single argument called when you run the command: DataSmith will:
- Download the dataset from Kaggle
- Profile the data:
- Missing values
- Duplicates
- Outliers
- Cardinality
- Data types
- Coordinate specialized agents to analyze the dataset
- Generate a deterministic execution plan
- Clean the data using reproducible transformations
- Compute a composite quality score (0–100%)
- Generate an interactive dashboard
- Save both the cleaned dataset and a full pipeline audit JSON
The result is not just clean data, but understanding and confidence in the data you are using.
And when calling the CLI command, this is what is being called under the hood. There are many more parameter options that the user can choose from:
run_datasmith(
input_file=input_file,
output_file=output_file,
pipeline_file=pipeline_file,
agent=agent,
rows=rows,
cols=cols,
instructions=instructions,
verbose=verbose,
download_kaggle=download_kaggle,
kaggle_dataset=kaggle_dataset,
show_dashboard=dashboard,
dev_mode=dev,
)
How We Built It
Core Stack
- Python — pipeline orchestration
- LangChain — multi-agent coordination
- pandas / NumPy / scikit-learn — statistical operations
- Typer — clean and composable CLI
- @dataclass — structured agent outputs
- Chart.js — interactive dashboards
Supports OpenAI, Anthropic, and xAI with no code changes.
Agent Architecture
DataSmith uses specialized agents, each responsible for a single aspect of data quality.
Missing Value Agent
- Drop rows
- Impute values (mean, median, mode, constant)
- Flag columns as high-risk
Runs in:
- Rule-based mode (fast, deterministic)
- AI mode (context-aware reasoning)
Outlier Agent
- Uses the IQR method (±1.5 × IQR)
- Detects statistically anomalous values
- Flags risk without blindly removing data
Validation Agent
- Data leakage risks
- High-cardinality features
- Severe imbalance
- Schema inconsistencies
- Duplicate contamination
- Readiness score
- Warnings
- Actionable recommendations
Composite Quality Scoring Single “ready / not ready” scores were misleading. We replaced them with a composite metric: $$ Overall Quality = (Missing Data × 0.25) + (Duplicates × 0.15) + (Outliers × 0.20) + (Data Types × 0.20) + (Risk Flags × 0.20) $$ This gives users a realistic, interpretable view of dataset health and highlights exactly where improvements are needed.
Observability (Optional)
When run with --dev, DataSmith integrates Arize Phoenix to track:
- Per-agent latency
- Token usage
- LLM costs
- Decision traces Observability is disabled by default, ensuring production performance is never impacted.
Dashboard
DataSmith generates an interactive HTML dashboard showing:
- Quality score breakdowns
- Outlier distributions per column
- Agent decision summaries
- Observability metrics (dev mode) This transforms data cleaning from a hidden preprocessing step into a transparent, explainable process.
Challenges We Faced
Scoring Inconsistency
- The CLI showed a different score than the dashboard.
- Noisy Observability Logs
- Token Costs
- Different LLM Performance (OpenAI, Anthropic, xAI)
- Running LLM agents on every decision was expensive.
What We Learned
- Clean data is a systems problem, not just a scripting task
- Agent orchestration requires strict boundaries
- Observability is essential for trust
- Structured outputs are more reliable than raw LLM text
- Composite metrics provide better insight than single scores
What’s Next
- Make Datasmith into a production level tool to be published after NexHack
- Advanced statistical methods (z-score, DBSCAN, distribution fitting)
- MCP server integration for agent coordination and debugging
- PyPI release (pip install datasmith)
- REST API with FastAPI
- Expanded CLI flags (aggressive cleaning, outlier preservation)
- Fully interactive web UI
- Multi-format support (images, JSON, Parquet, time-series)
- Custom agent marketplace
- TypeScript support for web-first pipelines
Built With
- anthropic
- api
- kaggle
- langchain
- numpy
- openai
- pandas
- python
- scikit-learn
- token
- typer
- xai
Log in or sign up for Devpost to join the conversation.