Inspiration

In machine learning and data science, the most important factor is not the model, it’s the data. A simple model trained on clean, well-understood data will almost always outperform a sophisticated model trained on noisy, inconsistent, or poorly prepared datasets. Despite this, data cleaning remains one of the most painful and inaccessible parts of the ML workflow. The problem isn’t a lack of motivation, it’s complexity. To properly prepare data, you’re expected to: Be fluent in tools like pandas, NumPy, and scikit-learn Understand statistical concepts such as outliers, distributions, and imputation strategies Know how to diagnose issues like leakage, cardinality explosions, or schema inconsistencies Make decisions that directly affect model performance, often without feedback For many developers, students, and even experienced engineers, this becomes a bottleneck. People either: Skip proper validation entirely Rely on copy-pasted cleaning scripts Or blindly apply transformations without understanding their impact We wanted to change that. DataSmith was built to make high-quality data preparation easy, transparent, and accessible to anyone. Our goal was to create a developer tool that: Ensures users are working with good datasets Explains why certain cleaning decisions are made Can be dropped into any workflow with a single, linear command Whether it’s a website backend, a research notebook, a production pipeline, or a small script, DataSmith is designed to be flexible for any use case while remaining simple enough for anyone to use.

What It Does

DataSmith is an Agentic-powered data cleaning pipeline that prepares datasets for machine learning using coordinated decision-making agents. With a single command:

python main.py --kaggle "ethereum-dataset" --ai \
               --output clean.csv \
               --instructions "Remove duplicate rows"

And here is the single argument called when you run the command: DataSmith will:

  • Download the dataset from Kaggle
  • Profile the data:
  • Missing values
  • Duplicates
  • Outliers
  • Cardinality
  • Data types
  • Coordinate specialized agents to analyze the dataset
  • Generate a deterministic execution plan
  • Clean the data using reproducible transformations
  • Compute a composite quality score (0–100%)
  • Generate an interactive dashboard
  • Save both the cleaned dataset and a full pipeline audit JSON

The result is not just clean data, but understanding and confidence in the data you are using.

And when calling the CLI command, this is what is being called under the hood. There are many more parameter options that the user can choose from:

run_datasmith(
        input_file=input_file,
        output_file=output_file,
        pipeline_file=pipeline_file,
        agent=agent,
        rows=rows,
        cols=cols,
        instructions=instructions,
        verbose=verbose,
        download_kaggle=download_kaggle,
        kaggle_dataset=kaggle_dataset,
        show_dashboard=dashboard,
        dev_mode=dev,
    )

How We Built It

Core Stack

  • Python — pipeline orchestration
  • LangChain — multi-agent coordination
  • pandas / NumPy / scikit-learn — statistical operations
  • Typer — clean and composable CLI
  • @dataclass — structured agent outputs
  • Chart.js — interactive dashboards

Supports OpenAI, Anthropic, and xAI with no code changes.

Agent Architecture

DataSmith uses specialized agents, each responsible for a single aspect of data quality.

Missing Value Agent

  • Drop rows
  • Impute values (mean, median, mode, constant)
  • Flag columns as high-risk

Runs in:

  • Rule-based mode (fast, deterministic)
  • AI mode (context-aware reasoning)

Outlier Agent

  • Uses the IQR method (±1.5 × IQR)
  • Detects statistically anomalous values
  • Flags risk without blindly removing data

Validation Agent

  • Data leakage risks
  • High-cardinality features
  • Severe imbalance
  • Schema inconsistencies
  • Duplicate contamination
  • Readiness score
  • Warnings
  • Actionable recommendations

Composite Quality Scoring Single “ready / not ready” scores were misleading. We replaced them with a composite metric: $$ Overall Quality = (Missing Data × 0.25) + (Duplicates × 0.15) + (Outliers × 0.20) + (Data Types × 0.20) + (Risk Flags × 0.20) $$ This gives users a realistic, interpretable view of dataset health and highlights exactly where improvements are needed.

Observability (Optional)

When run with --dev, DataSmith integrates Arize Phoenix to track:

  • Per-agent latency
  • Token usage
  • LLM costs
  • Decision traces Observability is disabled by default, ensuring production performance is never impacted.

Dashboard

DataSmith generates an interactive HTML dashboard showing:

  • Quality score breakdowns
  • Outlier distributions per column
  • Agent decision summaries
  • Observability metrics (dev mode) This transforms data cleaning from a hidden preprocessing step into a transparent, explainable process.

Challenges We Faced

Scoring Inconsistency

  • The CLI showed a different score than the dashboard.
  • Noisy Observability Logs
  • Token Costs
  • Different LLM Performance (OpenAI, Anthropic, xAI)
  • Running LLM agents on every decision was expensive.

What We Learned

  • Clean data is a systems problem, not just a scripting task
  • Agent orchestration requires strict boundaries
  • Observability is essential for trust
  • Structured outputs are more reliable than raw LLM text
  • Composite metrics provide better insight than single scores

What’s Next

  • Make Datasmith into a production level tool to be published after NexHack
  • Advanced statistical methods (z-score, DBSCAN, distribution fitting)
  • MCP server integration for agent coordination and debugging
  • PyPI release (pip install datasmith)
  • REST API with FastAPI
  • Expanded CLI flags (aggressive cleaning, outlier preservation)
  • Fully interactive web UI
  • Multi-format support (images, JSON, Parquet, time-series)
  • Custom agent marketplace
  • TypeScript support for web-first pipelines

Built With

Share this project:

Updates