Datasmith | Devpost

Inspiration

In machine learning and data science, the most important factor is not the model, it’s the data. A simple model trained on clean, well-understood data will almost always outperform a sophisticated model trained on noisy, inconsistent, or poorly prepared datasets. Despite this, data cleaning remains one of the most painful and inaccessible parts of the ML workflow. The problem isn’t a lack of motivation, it’s complexity. To properly prepare data, you’re expected to: Be fluent in tools like pandas, NumPy, and scikit-learn Understand statistical concepts such as outliers, distributions, and imputation strategies Know how to diagnose issues like leakage, cardinality explosions, or schema inconsistencies Make decisions that directly affect model performance, often without feedback For many developers, students, and even experienced engineers, this becomes a bottleneck. People either: Skip proper validation entirely Rely on copy-pasted cleaning scripts Or blindly apply transformations without understanding their impact We wanted to change that. DataSmith was built to make high-quality data preparation easy, transparent, and accessible to anyone. Our goal was to create a developer tool that: Ensures users are working with good datasets Explains why certain cleaning decisions are made Can be dropped into any workflow with a single, linear command Whether it’s a website backend, a research notebook, a production pipeline, or a small script, DataSmith is designed to be flexible for any use case while remaining simple enough for anyone to use.

What It Does

DataSmith is an Agentic-powered data cleaning pipeline that prepares datasets for machine learning using coordinated decision-making agents. With a single command:

python main.py --kaggle "ethereum-dataset" --ai \
               --output clean.csv \
               --instructions "Remove duplicate rows"

And here is the single argument called when you run the command: DataSmith will:

Download the dataset from Kaggle
Profile the data:
Missing values
Duplicates
Outliers
Cardinality
Data types
Coordinate specialized agents to analyze the dataset
Generate a deterministic execution plan
Clean the data using reproducible transformations
Compute a composite quality score (0–100%)
Generate an interactive dashboard
Save both the cleaned dataset and a full pipeline audit JSON

The result is not just clean data, but understanding and confidence in the data you are using.

And when calling the CLI command, this is what is being called under the hood. There are many more parameter options that the user can choose from:

run_datasmith(
        input_file=input_file,
        output_file=output_file,
        pipeline_file=pipeline_file,
        agent=agent,
        rows=rows,
        cols=cols,
        instructions=instructions,
        verbose=verbose,
        download_kaggle=download_kaggle,
        kaggle_dataset=kaggle_dataset,
        show_dashboard=dashboard,
        dev_mode=dev,
    )

How We Built It

Core Stack

Python — pipeline orchestration
LangChain — multi-agent coordination
pandas / NumPy / scikit-learn — statistical operations
Typer — clean and composable CLI
@dataclass — structured agent outputs
Chart.js — interactive dashboards

Supports OpenAI, Anthropic, and xAI with no code changes.

Agent Architecture

DataSmith uses specialized agents, each responsible for a single aspect of data quality.

Missing Value Agent

Drop rows
Impute values (mean, median, mode, constant)
Flag columns as high-risk

Runs in:

Rule-based mode (fast, deterministic)
AI mode (context-aware reasoning)

Outlier Agent

Uses the IQR method (±1.5 × IQR)
Detects statistically anomalous values
Flags risk without blindly removing data

Validation Agent

Data leakage risks
High-cardinality features
Severe imbalance
Schema inconsistencies
Duplicate contamination
Readiness score
Warnings
Actionable recommendations

Composite Quality Scoring Single “ready / not ready” scores were misleading. We replaced them with a composite metric: $$ Overall Quality = (Missing Data × 0.25) + (Duplicates × 0.15) + (Outliers × 0.20) + (Data Types × 0.20) + (Risk Flags × 0.20) $$ This gives users a realistic, interpretable view of dataset health and highlights exactly where improvements are needed.

Observability (Optional)

When run with --dev, DataSmith integrates Arize Phoenix to track:

Per-agent latency
Token usage
LLM costs
Decision traces Observability is disabled by default, ensuring production performance is never impacted.

Dashboard

DataSmith generates an interactive HTML dashboard showing:

Quality score breakdowns
Outlier distributions per column
Agent decision summaries
Observability metrics (dev mode) This transforms data cleaning from a hidden preprocessing step into a transparent, explainable process.