Stories by Shanding P. G on Medium

A Practical Framework for Explaining Visuals That Actually Drive Insight: TOPT

Shanding P. G — Mon, 04 May 2026 14:01:01 GMT

TOPT: Transform data into insight.

In data analytics, most people focus heavily on building dashboards, charts, and models. But there’s a quieter, more critical skill that often gets overlooked:

Explaining what the data actually means.

You can build the most visually appealing dashboard in Power BI or Tableau, but if you cannot clearly communicate the insight behind it, the value of your analysis drops significantly. This is where TOPT comes in.

TOPT is a highly effective structure for interpreting and communicating data visualizations. It is widely used in training environments and by practitioners who care about clarity and decision-making.

What is TOPT?

TOPT is a simple framework that helps you explain any data visualization in a structured, logical way. It stands for:

T — Title (or Key Message)
O — Overview
P — Pattern
T — Takeaway

At its core, TOPT forces you to move from “what is this chart showing?” to “why does this matter?”

TOPT step by step overview

Why Most Analysts Struggle with Explaining Visuals

Before diving deeper into TOPT, it’s worth understanding the common failure modes in analytics communication:

1. Description without insight

“This chart shows sales by month…”

That’s not analysis. That’s narration.

2. Unstructured observations

“Sales increased here, dropped there, and also something happened in March…”

This creates cognitive overload and weakens your message.

3. No clear takeaway

The audience is left wondering:

“So what should I do with this?”

4. Over-reliance on visuals

Many assume the chart “speaks for itself.” It doesn’t.

Charts support thinking — they don’t replace explanation.

The TOPT Framework Explained

Let’s break down each component in a way that reflects real analytical thinking.

1. T — Title (Key Message)

This is the headline insight.

Not the chart title like: “Monthly Sales Data”

But the actual message: “Sales grew consistently after Q1 due to improved conversion rates.”

This does two things:

Anchors your audience immediately
Forces you to clarify your own thinking

If you can’t write the title clearly, you probably don’t understand the insight yet.

2. O — Overview

Now you provide context.

What data is being shown?
What variables are involved?
What time frame or segmentation exists?

Example:

“This chart shows monthly revenue from January to June 2026 across all product categories.”

Keep this concise. The goal is orientation, not analysis.

3. P — Pattern

This is where analytical thinking becomes visible.

You identify:

Trends (upward/downward)
Variations (spikes, dips)
Comparisons (categories, segments)
Anomalies (outliers, unexpected behavior)

Example:

“Revenue declined slightly in March but increased steadily from April through June, with the steepest growth between May and June.”

At this stage, you are answering: What is happening in the data?

4. T — Takeaway

This is the most important step.

You interpret the pattern and connect it to meaning:

Why is this happening?
What does it imply?
What decision or action does it inform?

Example:

“The post-March growth suggests the new marketing campaign improved customer acquisition and conversion rates.”

This is where you transition from analysis → insight → value.

A Full Example Using TOPT

Sample Chart

Let’s combine everything into a single narrative:

Title: Sales growth accelerated after Q1 due to improved conversion strategies

Overview: This chart shows monthly revenue from January to June 2026

Pattern: Revenue dipped slightly in March but increased consistently from April onward, with the highest jump in June

Takeaway: The upward trend suggests that the new campaign launched in April significantly improved conversion rates

Notice how structured, clear, and decision-ready this is.

Why TOPT Works

1. It enforces clarity

You cannot hide behind vague explanations.

2. It improves stakeholder communication

Executives and non-technical audiences don’t want raw data — they want meaning.

3. It builds analytical discipline

You move through a logical chain:

Context → Observation → Interpretation → Insight

4. It scales across tools

Whether you’re using:

Excel
Power BI
Tableau
Python (Matplotlib/Seaborn)

TOPT remains applicable.

When to Use TOPT

TOPT is especially useful in:

Dashboard walkthroughs
Business presentations
Stakeholder reports
Data storytelling sessions
Training and teaching analytics

If you are presenting any chart, you should be thinking in TOPT.

Common Mistakes When Using TOPT

1. Weak or generic titles

“Sales Data Overview” → not useful

2. Skipping the takeaway

This turns your explanation into commentary instead of insight.

3. Mixing overview and pattern

Keep context separate from analysis.

4. Overcomplicating the explanation

TOPT is meant to simplify, not add jargon.

Practical Tip: Use TOPT in Dashboards

If you’re building dashboards:

Use the Title as your insight headline
Add a short Overview in tooltips or subtitles
Highlight Patterns with annotations
Include Takeaways in summary cards or notes

This turns dashboards from passive visuals into decision tools.

Final Thoughts

TOPT is simple, but it addresses a fundamental gap in analytics:

The gap between seeing data and understanding it.

In a world filled with dashboards and metrics, the real differentiator is not who can build charts — it’s who can explain them clearly and extract meaning.

If you adopt TOPT consistently, you’ll notice a shift:

Your explanations become sharper
Your insights become clearer
Your impact becomes more tangible

Because at the end of the day, analytics is not about charts.

It’s about decisions informed by insight.

Clean Code for Career Changers: How to Write Code Professionals Are Proud Of

Shanding P. G — Thu, 08 Jan 2026 15:57:28 GMT

Entering the world of software development can feel overwhelming. You learn syntax, frameworks, tools, and then someone drops a term like “clean code” on you and suddenly you’re questioning everything you just wrote.

At its core, clean code isn’t about perfection; it’s about clarity and maintainability. You want code that doesn’t just work, but that the next engineer (or future you) can easily read, understand, and improve.

What Is Clean Code and Why It Matters

Clean code refers to code that is easy to read, understand, and modify. It’s foundational to scalable software development.

Clean code becomes crucial when:

You’re collaborating with other developers.
You’re onboarding onto an existing project.
You need to maintain, extend, or debug software long after it was first written.

Practical Principles of Clean Code

1. Use Meaningful Names

Using descriptive names helps anyone reading the code quickly understand what each variable or function represents, reducing confusion and errors.

Bad:

x = 10
y = 20
z = x + y

Good:

apples_count = 10
oranges_count = 20
total_fruits = apples_count + oranges_count

2. Keep Functions Short and Focused

Small, focused functions are easier to test, maintain, and reuse. Each function should perform a single, well-defined task.

Bad:

def process_order(order):
    validate_order(order)
    calculate_discount(order)
    apply_discount(order)
    send_invoice(order)

Good:

def validate_order(order):
    # validation logic
    pass

def calculate_discount(order):
    # discount logic
    pass

def apply_discount(order):
    # apply discount logic
    pass

def send_invoice(order):
    # send invoice logic
    pass

3. Avoid Hard-Coded Values

Using named constants makes your code more readable and easier to maintain, and prevents errors when values change.

Bad:

total_price = quantity * 0.1

Good:

DISCOUNT_RATE = 0.1
total_price = quantity * DISCOUNT_RATE

4. Write Comments That Add Value

Comments should explain the reasoning behind the code, not repeat what the code already shows. This helps future readers understand the “why”.

Bad:

# increment x by 1
x += 1

Good:

# increment score for correct answer
score += 1

5. Follow Established Style Guidelines

Consistent formatting improves readability and collaboration across teams, making it easier to maintain a shared codebase.

Bad:

function myFunc(){console.log("hello")}

Good:

function myFunc() {
    console.log("hello");
}

6. Refactor Continuously

Regularly revisiting and improving code reduces complexity and prevents technical debt, keeping the codebase healthy over time.

Bad:

if user_type == 'admin':
    permissions = ['read','write','delete']
elif user_type == 'editor':
    permissions = ['read','write']
elif user_type == 'viewer':
    permissions = ['read']

Good:

USER_PERMISSIONS = {
    'admin': ['read','write','delete'],
    'editor': ['read','write'],
    'viewer': ['read']
}
permissions = USER_PERMISSIONS.get(user_type, [])

How AI Can Help You Write Cleaner Code

Code Suggestions and Autocomplete: AI tools like GitHub Copilot suggest idiomatic code as you type.

AI-Assisted Code Review and Quality Checks: Platforms can automatically detect duplication, complexity, or style violations.

Automated Test Generation: AI can generate unit tests that enforce good code structure.

Real-Time Coding Feedback: AI-augmented IDEs provide instant suggestions to improve readability and maintainability.

Balancing AI with Human Judgment: Always review AI-generated code to ensure correctness, security, and alignment with project goals.

Clean code is a hallmark of professionalism. For anyone upskilling or pivoting into tech, mastering these practices early positions you as a reliable, thoughtful developer. Pair these principles with AI tools strategically, and you’ll not only write code that works — you’ll write code that’s good.

The Dashboardification of the Data Analyst

Shanding P. G — Fri, 05 Dec 2025 12:28:56 GMT

A reflection on how analytics drifted from rigorous inquiry to dashboard assembly lines.

Over the past decade, data analysis has undergone an identity crisis — one shaped by bootcamps, social media hype, and the mass migration of people “breaking into tech.” Somewhere between the endless carousel of “Top 10 Power BI Interview Questions” and the explosion of two-week certificate courses, the data analyst morphed from an investigator, a thinker, a translator of ambiguity… into a dashboard factory.

It didn’t happen overnight. But today, the profession sits in a strange place where many analysts are celebrated not for the quality of their thinking, but for how visually pleasing their business intelligence tool of choice can make a bar chart look.

And so we arrive at the Dashboardification Era.

When Everyone Became a Data Analyst

The promise was seductive:

“Learn SQL, Excel, and Power BI. Earn in dollars. Work from anywhere.”

Influencers and training programs told a generation that data analysis was the new gold rush — low barrier to entry, high pay, and a straightforward skillset. And in some ways, they weren’t wrong. Tools did become easier. Visualization platforms became drag-and-drop. Cloud analytics made storage trivial. Even Python and SQL felt less intimidating as the ecosystem matured.

But with that wave came something else: mass saturation.

Suddenly, thousands of aspiring analysts flooded the market, each armed with the same dashboards, the same resume template, the same “projects” built from the same publicly available datasets. In many portfolios, one could barely find a problem statement or a question worth answering — just charts wrapped in vibrant color palettes.

The field was no longer about thinking. It was about tool usage.

The Erosion of First Principles

Before the boom, data analysis was fundamentally about reasoning:

What question are we answering?
Why does the data look like this?
What assumptions underlie these numbers?
What does the business need to understand?
What is the story behind the anomaly?

Today, many new analysts can build a dashboard but cannot explain whether an observed trend is statistically significant, or whether a spike represents signal or noise. Some can join tables but cannot articulate why the tables should be joined that way. Many can calculate KPIs but cannot describe whether those KPIs actually matter to the business.

It’s not their fault — not entirely.
The industry incentivized speed over depth, aesthetics over accuracy, and tool proficiency over conceptual mastery.

In hiring pipelines, recruiters ask:

Text within this block will maintain its original spacing when published

“Can they use Power BI?”
instead of
“Can they think?”

The result is predictable:
A generation of analysts optimized to build dashboards but not to build insight.

BI Tools as a Crutch, Not a Craft

Power BI, Tableau, Looker, Metabase — these tools are incredible. They democratized analytics. They empowered teams. They reduced friction.

But somewhere along the way, the profession mistook the tool for the work.

We forgot:

A dashboard is not an analysis.
A chart is not a conclusion.
A metric is not an insight.
A beautiful report is not a business impact.

In the worst cases, teams become addicted to “dashboard theatre” — the production of visually stimulating but strategically empty artifacts that create the illusion of intelligence.

Text within this block will maintain its original spacing when published

Executives say, “Show me the dashboard,”
when they should be saying,
“Help me understand what matters.”

A Saturated Field Creates Shallow Incentives

With so many people trying to enter the profession, the path of least resistance has become the default:

Learn a few DAX functions
Build a portfolio with three dashboards
Upload to GitHub
Share on LinkedIn with the caption “My first Power BI project!”
Repeat

But this approach strips the soul from the craft.
It replaces curiosity with templates.
It reduces analysis to decoration.

And worst of all, it creates the illusion that data work is easy — that it begins and ends with a dashboard.

What We Lost Along the Way

The dashboardification of analytics has quietly eroded:

1. Problem-solving discipline

Asking “why” five times. Challenging assumptions. Understanding causality. Testing hypotheses.

2. Statistical thinking

Sample bias, confidence intervals, distributions, outliers — too often ignored.

3. Business acumen

Knowing what decisions matter. Understanding incentives. Connecting insights to actions.

4. Narrative clarity

Data storytelling beyond color gradients — true communication, not decoration.

5. Intellectual craftsmanship

Treating analysis as an act of thinking, not an act of clicking.

But It’s Not All Doom

The field is not dead — it’s simply noisy.

True analysts still exist.
They are the ones who:

Start with a question, not a dashboard.
Refuse to visualize until they understand.
Can defend every assumption in their SQL query.
See data not as pixels, but as decisions.
Know when NOT to build a dashboard.

These analysts are not threatened by the saturation.
They rise above it.

The Way Back: A Rebellious Proposal

If the profession wants to reclaim its identity, we need a new ethos:

1. Less tooling, more thinking

The tool should serve the mind, not replace it.

2. Prioritize problem statements, not “projects”

We don’t need more dashboards; we need more questions.

3. Make dashboards the end, not the beginning

A dashboard is a delivery mechanism, not the analysis itself.

4. Celebrate insight, not aesthetics

The most beautiful chart is the one that changes a decision.

The Real Call to Action

This is not a critique of “new analysts.”
It is a critique of what the industry — intentionally or not — told them the profession should be.

The Compliance Wake-Up Call for Data-Driven Organizations

Shanding P. G — Mon, 20 Oct 2025 09:22:12 GMT

Why Data Governance Must Become Part of Your Business DNA

“In today’s world, data governance isn’t optional — it’s survival.”

For a long time, data governance sat quietly in the background — a technical chore for IT teams, filed somewhere between system backups and password policies. But the world has changed.

Today, every click, campaign, and customer conversation is powered by data. And when data drives everything, governance can’t live in a corner anymore.

Welcome to the new reality — where compliance is a boardroom conversation, and data responsibility is everyone’s job.

Regulations like GDPR (Europe), CCPA (California), POPIA (South Africa), and NDPR (Nigeria) have redrawn the lines of accountability. A single mistake — a misplaced file, an unprotected record, an unauthorized access — can lead to crippling fines, reputational damage, or loss of public trust.

The message is clear:

“Data governance is no longer about where data lives — it’s about how it’s used, shared, and protected by everyone in the organization.”

From HR managing employee information to marketing teams running personalized campaigns, every department now handles sensitive data. And that means every department shares liability.

The organizations that lead in this era are the ones that don’t wait for compliance deadlines — they build governance into their DNA, treating it as a strategic advantage, not a bureaucratic burden.

Why Governance Frameworks Fail — And What to Do About It

It’s easy to design a policy. It’s hard to make people live by it.

Most governance frameworks fail not because they’re poorly written, but because they’re poorly adopted. Employees see them as obstacles, not enablers — as red tape, not responsibility.

And when culture resists, compliance collapses.

“A governance framework without cultural buy-in is like a security system everyone ignores.”

Technology can help, but it can’t fix culture. Yes, tools like eDiscovery, data classification platforms, and cloud compliance monitors can automatically detect and secure sensitive information. But they can’t replace human accountability.

That’s why the most compliant organizations don’t start with software — they start with people.

They:
1. Train teams to understand the “why,” not just the “what.”
2. Communicate clearly from leadership down to every desk.
3. Reward responsible data behavior the same way they reward innovation.

When employees see that governance protects their work, reputation, and the company’s integrity, it stops being a box to tick — it becomes part of the culture.

Making Governance Everyone’s Responsibility

In too many companies, governance is treated like a legal problem or a technical checklist. But data doesn’t care about your org chart — it flows across systems, teams, and continents.

That means governance must, too.

The smartest organizations are tearing down silos by appointing data stewards — champions within departments who act as local custodians of good data practices. These stewards bridge the gap between policy and execution, making governance visible in day-to-day workflows.

Meanwhile, cross-functional collaboration is non-negotiable.

When marketing aligns with IT, HR collaborates with compliance, and leadership sets the tone, governance stops being an afterthought and becomes an operational standard.

“You can’t govern what you can’t see. Visibility across teams is the foundation of accountability.”

Unified dashboards, clear definitions, and transparent ownership turn governance from a concept into a living, breathing process.

Governance, Trust, and the NDPR Advantage

Let’s be clear: compliance isn’t just about avoiding fines — it’s about earning trust.

Customers and regulators are becoming more vigilant, and in Nigeria, the NDPR (Nigeria Data Protection Regulation) has raised the bar. The NDPR isn’t a local inconvenience — it’s a signal that Nigeria’s digital economy is maturing, demanding accountability from every data-driven organization.

Companies that proactively implement NDPR-aligned governance aren’t just protecting themselves — they’re positioning for global competitiveness.

“Trust is the new currency. And data governance is how you mint it.”

When you handle data ethically and transparently, you don’t just comply — you differentiate.

From Policy to Power: The Leadership Imperative

Data governance has evolved from a technical necessity into a strategic weapon.

Boards are realizing that poor data management isn’t an IT risk — it’s a business risk. Data breaches destroy trust. Misuse of personal information drives customers away. And regulatory fines can cripple growth.

Strong governance, on the other hand, builds resilience, efficiency, and credibility.

“Good governance doesn’t slow innovation — it accelerates it by removing uncertainty.”

Nigeria’s business landscape is digital, connected, and accelerating fast. Leaders who ignore governance are gambling with their organization’s future. Those who embrace it will lead industries — not react to them.

Final Thought

Compliance isn’t paperwork.
It’s protection.
It’s reputation.
It’s leadership.

Whether you’re a CTO building systems, a data analyst managing insights, or a CEO shaping strategy, the message is the same:
Governance is everyone’s job.

When organizations embed trust, transparency, and accountability into how they handle data, they don’t just meet regulations — they set the standard.

“In the data economy, trust is your greatest asset — and governance is how you earn it.”

Building Trust in Your Data: A Practical Guide to Data Quality Tests

Shanding P. G — Mon, 06 Oct 2025 15:15:05 GMT

In today’s data-driven world, businesses are only as good as the data they rely on. But here’s the hard truth: not all data can be trusted. Bad data sneaks in quietly, leading to poor insights, flawed decisions, and expensive mistakes.

That’s why data quality testing is no longer optional — it’s a necessity. If you’re just getting started, this guide walks you through the most important types of data quality tests, why they matter, and — most importantly — how to implement them in Python using Great Expectations.

Why Data Quality Testing Matters

Imagine building a house on sand. That’s what it’s like to make strategic decisions without checking the reliability of your data.

A good data quality process ensures:

Confidence — You know your data is accurate, complete, and usable.
Efficiency — Your team spends less time firefighting data issues.
Better decisions — Analytics, AI models, and dashboards give real insight, not noise.

The Pitfalls of Anomaly Detection

Anomaly detection is often the first step teams take. It’s appealing because it’s automated, standardized, and easy to apply across lots of data. But here’s the catch: anomaly detection is low information.

What anomaly detection can tell you

When and where something unusual happened
The characteristics of the anomaly

What it can’t tell you

The business significance of the anomaly
Whether it’s urgent
Which stakeholders are affected

This often leads to alert fatigue — hundreds of signals but little actionable insight.

Use anomaly detection sparingly. It’s best for

Monitoring new or unfamiliar data sources
Establishing a baseline when you lack stakeholder knowledge

Over time, replace anomaly detection with specific, high-information tests.

The Basics: Four Practical Dimensions of Data Quality

When starting out, focus on these four core test dimensions. They cover the most common data issues and give you strong confidence in your pipelines.

1. Missingness

Goal: Ensure data isn’t unexpectedly missing — or appearing when it shouldn’t.

Example: In a customer dataset, email should never be null.

import great_expectations as gx
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "email": ["a@test.com", "b@test.com", None, "d@test.com"]
})

# Create a GE context from dataframe
context = gx.get_context()
datasource = context.sources.add_pandas("my_datasource")
data_asset = datasource.add_dataframe_asset("customers")
batch_request = data_asset.build_batch_request(dataframe=data)

# Create Expectation Suite
suite = context.suites.add("missingness_suite")

# Null check: email should not be null
validator = context.get_validator(
    batch_request=batch_request,
    suite=suite
)
validator.expect_column_values_to_not_be_null("email")

# Run and view results
results = validator.validate()
print(results)

In the example above, the test will flag the missing email value in row 3.

2. Schema

Goal: Verify that expected columns are present, in the right order, and of the right type.

Example: You expect customer_id (int) and email (string) to exist.

# Expect specific schema
validator.expect_table_columns_to_match_set(
    column_set=["customer_id", "email"]
)

# Expect correct data types
validator.expect_column_values_to_be_of_type("customer_id", "int64")
validator.expect_column_values_to_be_of_type("email", "object")

3. Volume

Goal: Check that the number of rows is within expected bounds.

Example: Yesterday’s transactions table usually has 10k–15k rows.

# Expect row count to be within range
validator.expect_table_row_count_to_be_between(min_value=10000, max_value=15000)

4. Ranges

Goal: Ensure numbers and dates fall within acceptable ranges.

Example: order_amount should always be > 0

# Expect order_amount column values to be positive
validator.expect_column_values_to_be_between("order_amount", min_value=1)

Taking It to the Next Level

Once the basics are in place, you can implement advanced tests. These go beyond simple presence and ranges to ensure your data is valid, unique, and relationally consistent

Validity Tests

Goal: Check that values are plausible, not just present. Validity ensures that the data falls within logical or business-defined sets, ranges, or patterns.

Example: status must always be one of {Pending, Active, Closed}.

# Expect status column values to belong to a fixed set
validator.expect_column_values_to_be_in_set(
    "status", ["Pending", "Active", "Closed"]
)

You can also apply numeric validity checks (e.g., salaries, dates) or pattern validity (e.g., emails match regex).

Uniqueness Tests

Goal: Ensure that values that should be unique actually are unique (e.g., primary keys), and that values expected to be diverse aren’t unexpectedly duplicated.

Example: Each customer_id must be unique.

# Expect customer_id column values to be unique
validator.expect_column_values_to_be_unique("customer_id")

This prevents duplicate records from creeping into your dataset.

Referential Integrity Tests

Goal: Validate relationships between columns or tables. Referential integrity ensures that related data is properly connected.

Example 1: Cross-column integrity → A day=31 should never pair with month=February.

# Check cross-column integrity
validator.expect_column_pair_values_to_be_equal(
    "orders.customer_id", "customers.customer_id"
)

Example 2: Cross-table integrity → Every customer_id in the orders table must exist in the customers table.

# Check referential integrity between orders and customers
validator.expect_multicolumn_values_to_be_unique(
    column_list=["order_id", "customer_id"]
)

This ensures that your foreign keys and relationships remain consistent.

Final Thoughts

Data quality isn’t about perfection — it’s about trust.

Start small with the basics: missingness, schema, volume, and ranges. Then iterate with more advanced tests as your pipelines mature.

By replacing vague anomaly alerts with explicit, expressive tests, you’ll create a system that not only detects problems but also tells you exactly what action to take.

The result?

Fewer surprises
Faster fixes
And ultimately, data you can rely on

Great Expectation

Want to dive deeper? Check out Great Expectations, the open-source framework powering these examples.

A message from our Founder

Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community.

Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community. ❤️

If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter.

And before you go, don’t forget to clap and follow the writer️!

Building Trust in Your Data: A Practical Guide to Data Quality Tests was originally published in Python in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a Free & Open-Source Starter Toolkit for Data Scientists

Shanding P. G — Mon, 06 Oct 2025 15:01:56 GMT

The Practical Data Scientist’s Free and Open-Source Toolkit

If you’re a data scientist or an aspiring one, you know the challenge: juggling datasets, cleaning pipelines, running experiments, sharing results, and monitoring models in production. Paid tools can help, but what if you could build a complete workflow with free and open-source software?

In this post, we’ll walk through a practical starter stack that covers the entire lifecycle — from data management to monitoring your APIs. The best part? All of these tools are free, open source, and production-ready.

Data Science workflow with tools

You don’t need enterprise subscriptions to achieve professional‑grade workflows. The combination of api‑analytics, MLflow, Streamlit, and the others forms a powerful, modular stack you can customize for your own projects

High-Level Workflow Diagram

This flow shows how the tools complement each other: versioning, validation, experiments, sharing, and monitoring.

1. Monitoring API Usage with api-analytics

api-analytics is a lightweight open-source tool for monitoring your APIs. It tracks request counts, response times, and error rates.

Setting up api-analytics is quite easy and straight forward. Below is a sample code snippet that shows how to setup api-analytics for an API with flask. It also supports Django, FastAPI, tornado, and it is also in other languages such as JavaScript and GO.

from flask import Flask
from api_analytics.flask import add_middleware

app = Flask(__name__)
add_middleware(app, )  # Add middleware

@app.get('/')
def root():
    return jsonify(
      {'message': 'Hello, World!'}
    )

@app.route("/predict")
def predict():
  return jsonify(
    {"prediction": 42}
  )

if name == "__main__":
  app.run(debug=True)

2. Tracking Experiments with MLflow + DVC

Managing experiments can get messy. MLflow keeps track of runs, parameters, and results, while DVC handles dataset and model versioning.

MLflow

MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible.

import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

mlflow.start_run():
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

acc = model.score(X_test, y_test)
mlflow.log_param("max_iter", 200)
mlflow.log_metric("accuracy", acc)
mlflow.sklearn.log_model(model, "logreg_model")

DVC

Data Version Control (DVC) lets you capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents. The result is a single history for data, code, and ML models that you can traverse — a proper journal of your work!

DVC matches the right versions of data, code, and models for you 💘

# Initialize DVC in your project
dvc init

# Track a dataset
dvc add data/raw_dataset.csv
git add data/raw_dataset.csv.dvc .gitignore
git commit -m "Track raw dataset with DVC"

This way, you can reproduce any experiment with exact data + parameters.

3. Ensuring Data Quality with Great Expectations

Great Expectations (GE) is an open-source Python library designed for data validation, profiling, and quality control. It allows you to define, test, and document your data expectations in a structured and automated way. It integrates seamlessly with popular data tools like Pandas, SQL databases, and Spark.

from great_expectations.dataset import PandasDataset
import pandas as pd

# Load data
df = pd.read_csv("data/raw_dataset.csv")
validated_df = PandasDataset(df)

# Add expectations
validated_df.expect_column_values_to_not_be_null("age")
validated_df.expect_column_values_to_be_between("age", 18, 90)
results = validated_df.validate()
print(results)

4. Sharing Results with Streamlit

Streamlit is an open-source framework designed to create and share beautiful web applications for data science and machine learning projects. It is specifically built for Python, making it easy for data scientists and machine learning engineers to deploy their models and visualize data without needing extensive knowledge of web development. Streamlit makes it easy to turn scripts into apps.

import streamlit as st
import pandas as pd
import joblib

model = joblib.load("logreg_model.pkl")

st.title("Iris Classifier")
sepal_length = st.slider("Sepal Length", 4.0, 8.0)
sepal_width = st.slider("Sepal Width", 2.0, 5.0)

X_new = pd.DataFrame([[sepal_length, sepal_width, 3.5, 1.4]])
prediction = model.predict(X_new)

st.write(f"Prediction: {prediction[0]}")

streamlit run app.py

5. BI Dashboards with Metabase or Superset

Metabase → Metabase is an open-source business intelligence (BI) tool that allows users to create interactive dashboards, visualize data, and analyze insights without requiring advanced technical skills. Dashboards in Metabase aggregate multiple data visualizations, metrics, and interactive elements into a single interface, enabling users to monitor and analyze key performance indicators (KPIs) and trends effectively.
Superset → Apache Superset is an open-source data exploration and visualization platform designed for creating interactive dashboards and analyzing data efficiently. It is a modern alternative to proprietary business intelligence tools, offering a wide range of features for users of all skill levels.

Both connect to databases and let you build dashboards without coding.

6. Synthetic Data & Annotation with Faker + Label Studio

Sometimes you need test data or labeled datasets.

Faker

Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.

from faker import Faker

fake = Faker()
for  in range(5):
  print(
    {
      "name": fake.name(),
      "email": fake.email(),
      "transaction": fake.randomnumber()
    }
  )

Label Studio

Label Studio is an open source data labeling tool. It lets you label data types like audio, text, images, videos, and time series with a simple and straightforward UI and export to various model formats.

pip install label-studio
label-studio start

This launches a UI for annotating images, text, or tabular data.

Putting It All Together

With this toolkit:

api-analytics → Monitor your API usage
MLflow + DVC → Track & reproduce experiments
Great Expectations → Validate data quality
Streamlit → Build interactive apps
Metabase / Superset → Share dashboards
Faker + Label Studio → Generate & annotate datasets

End-to-End Workflow Visualization

We have explored the entire data science journey — from sourcing and validating data to experimenting, visualizing, and monitoring real-world deployments. Every stage can be handled with free, open‑source tools that scale with your growth and creativity.

A Practical Guide to Building a Winning Data Strategy

Shanding P. G — Sat, 06 Sep 2025 11:01:32 GMT

In today’s data-driven world, organizations are collecting more data than ever before. But having data is not the same as using it effectively. To turn raw information into a strategic asset, businesses need a well-crafted data strategy — a roadmap that guides how data is collected, stored, managed, governed, and leveraged to achieve business goals.

What is a Data Strategy?

A data strategy is a comprehensive roadmap that defines how an organization will collect, store, manage, govern, and leverage its data assets to achieve specific business goals.

It is not just about technology — it’s about aligning people, processes, and tools to unlock the value of data. Simply put, a good data strategy ensures that every data initiative, whether it’s analytics, reporting, AI, or compliance, directly contributes to the organization’s success.

A strong data strategy typically:

Aligns with business goals to ensure data is used for measurable impact.
Outlines the technical architecture needed to store and process data.
Defines governance frameworks for quality, privacy, and security.
Establishes clear roles and responsibilities for data management.
Prioritizes investments in tools and talent to maximize value.
Provides measurement mechanisms to track progress and drive continuous improvement.

Defining Vision and Goals

Every strategy begins with a vision, and data strategy is no different. The vision should be clear, aspirational, and aligned with broader organizational goals.

For instance, an e-commerce company might aim to increase customer retention by 15% through personalized recommendations based on user behavior data.

But vision must be paired with measurable goals. Defining specific data objectives — such as improving reporting accuracy, reducing time to insights, or enhancing compliance monitoring — helps ensure that the strategy moves beyond theory into action. Equally important is communicating this vision across the organization to foster a shared data-driven culture.

The Three Phases of Data Strategy

A practical framework for data strategy involves three interconnected phases: Plan, Build, and Operate.

1. Plan: Laying the Foundation

Planning sets the groundwork by defining governance, architecture, and skill requirements.

Data Governance: Establish clear roles (data owners, stewards, users), enforce policies around data quality, access, and security, and create governance mechanisms to ensure compliance and resolve conflicts.
Data Architecture: Design a flexible, scalable architecture that addresses current and future needs. Choose appropriate storage and processing systems (e.g., data lakes, warehouses, cloud) and implement ETL pipelines for smooth data flow.
Talents & Skills: Identify skill gaps, attract and retain data talent, and invest in upskilling existing employees in analytics and literacy to foster organization-wide competence.

2. Build: Creating Capabilities

Once the foundation is set, the focus shifts to building capabilities that transform raw data into meaningful insights.

Data Quality: Define key quality dimensions such as accuracy, timeliness, and completeness. Implement data profiling and cleansing processes to fix errors and set up monitoring to ensure ongoing reliability.
Data Analytics: Invest in a robust analytics stack that supports diagnostic, predictive, and prescriptive analytics. Develop semantic models, algorithms, and visualization tools accessible to both technical experts and business users.
Data Security & Privacy: Protect data through security controls, encryption, and monitoring. Ensure compliance with regulations like GDPR or CCPA and provide regular training to employees to strengthen data security culture.

3. Operate: Driving Continuous Improvement

A data strategy is not static — it evolves with business priorities, regulatory requirements, and technological advancements.

Change Management: Treat data strategy as an ongoing journey. Continuously communicate its benefits, engage stakeholders, and foster adaptability in response to evolving data landscapes.
Technology & Infrastructure: Regularly evaluate and update tools to meet organizational needs. Maintain reliable infrastructure while experimenting with emerging technologies like machine learning, generative AI, and real-time analytics.
Metrics & Measurements: Define KPIs to measure progress (e.g., revenue uplift, customer satisfaction, cost reduction). Build dashboards to track performance, and use these insights to refine and optimize the strategy.

Why Data Strategy Matters

Organizations that fail to manage data strategically often face challenges such as:

Siloed and inconsistent data.
Poor decision-making due to lack of insights.
Higher risks of security breaches and compliance violations.
Wasted resources on duplicate or inefficient data projects.

By contrast, organizations with a strong data strategy benefit from:

Operational efficiency: streamlined processes, fewer delays, and better productivity.
Smarter decision-making: high-quality insights powering executive decisions.
Customer satisfaction: personalized, data-driven experiences.
Regulatory compliance: stronger protection against legal and reputational risks.
Higher ROI: reduced oversight needs and better use of human and technical resources.

A well-crafted data strategy is not a one-off project — it’s a living framework that evolves as business needs and technologies change. The ultimate goal is to make data a trusted, secure, and strategic asset that drives innovation, growth, and resilience.

Agentic AI Workflows Design Patterns, Examples, and what to watch in 2025

Shanding P. G — Sat, 30 Aug 2025 11:01:36 GMT

As Large Language Models (LLMs) evolve into autonomous agents, understanding agentic workflow design patterns has become essential for building robust agentic AI systems.

As AI systems mature, agentic workflows are emerging as the backbone of how Large Language Models (LLMs) and intelligent agents function in practice. These workflows define how agents perceive, plan, act, collaborate, and improve over time. In this article, we explore how design patterns and enterprise applications intersect to form the foundation of agentic AI.

With major enterprises already investing in these systems to scale faster and cut manual effort, the question isn’t if your organization will adopt agentic workflows, but how soon before they do so.

But what are agentic workflows, anyway?

What is an agentic AI workflow?

Agentic workflows are automated systems that combine artificial intelligence (AI) and machine learning (ML) to manage and execute tasks. These workflows are designed to take over repetitive and routine activities, allowing human workers to focus on high-value tasks that require creativity, strategic thinking, and decision-making.

Unlike generative AI, which produces content when prompted, agentic AI proactively manages tasks, coordinates steps, and executes workflows toward goals. This makes it more suitable for complex, real-world environments where context, adaptability, and human collaboration are essential.

What makes a workflow “agentic?”

To qualify as an agentic workflow, the AI system must operate autonomously, adapt its behavior based on outcomes, and work across multiple apps or environments while aligning with defined goals.

Characteristics of an agentic workflow:

Autonomous decision-making: helps analyze inputs, weigh options, and make the next move.
Contextual awareness: ensures intelligent agents pull from real-time data and historical inputs to guide decisions.
Goal-oriented: designed to work toward specific goals, such as meeting a delivery deadline or reducing resource strain.
Adapt in real time: continuously recalculate based on changes and reconfigure actions to stay aligned.

Components of agentic systems

AI agents: These are the autonomous workhorses of agentic systems. AI agents perform tasks, make real-time decisions, and adapt based on data, goals, and feedback. They’re digital teammates that can triage tickets, reschedule tasks, or reroute priorities without waiting on a human handoff. AI agents learn from feedback and stored memory so they can continuously improve their performance over time.
Large language models (LLMs): This is the reasoning layer that gives agents their brains. LLMs like GPT-5 or Claude allow agents to interpret goals, follow instructions, and communicate in natural language. This ability to understand context, plan next steps, and troubleshoot issues separates generative AI from agentic AI.
Tools: Agents don’t live in isolation. Agentic workflows are often integrated with multiple data sources, such as project management software, CRM tools, and communication platforms. This integration ensures that agents have access to relevant, up-to-date information to perform tasks efficiently. Standards like the Model Context Protocol (MCP) come into play here, giving AI agents a plug-and-play interface to connect with real-world apps and services.
Prompt engineering: This is how we tell agents what to do and how to do it. Effective prompt engineering gives AI agents the guidance they need to interpret complex tasks and execute them accurately. The better the prompt, the better the output, especially in multistep workflows where nuance matters.
Multi-agent collaboration: One agent is helpful. Many agents working in sync? That’s powerful. Multi-agent collaboration allows several specialized AI agents, like a scheduling agent, a data analysis agent, and a compliance agent, to work together on larger goals.

Agentic design patterns

Agentic workflows aren’t built from scratch every time. They rely on strategic, repeatable execution modes called design patterns. These patterns serve as architectural blueprints for how AI agents behave in complex workflows. Whether it’s triaging customer tickets or managing cross-functional launches, each pattern represents a reliable way to deploy agentic workflows that scale.

Let’s break down some of the most effective agentic design patterns.

Policy-Only Workflows

How it works

Uses LLMs as direct policy models to generate actions or plans without search or feedback loops.

ReAct gives AI agents the ability to talk themselves through problems, then do something about them. Instead of frontloading a plan or waiting for a complete reflection, this pattern has the AI agent think step by step and take action as needed, toggling between analysis and execution in real time.

When to use

Best for well-defined tasks with clear action sequences.

When not to use

This pattern isn’t useful for static predefined answers. Reasoning could add lag, and using external tools would limit effectiveness.

Example

A typical example is ReAct for question answering and Plan-and-Solve for math problems.

Feedback-Learning Workflows

Iteratively improves responses through feedback from self-reflection, tools, environment, or humans. Incorporates learning loops where agents analyze their performance and refine future attempts.

When to use

Ideal for tasks requiring continuous improvement and error correction capabilities, like generating code or complex problem solving. It’s also useful when there’s a high cost of mistakes or risk of compounding errors.

When not to use

Avoid this pattern for complex multistep tasks and tasks that require logic or reasoning.

Workflow Orchestration/Agentic Process Automation

Automates complex business processes by orchestrating APIs and tools through LLM-driven workflow generation. This pattern is often enabled via protocols like MCP, which give agents a standardized way to understand and interact with tools in their environment.

When to use

Shifts from manual Robotic Process Automation to intelligent Agentic Process Automation. Ideal for enterprise automation, API integration, and dynamic workflow adaptation.

When not to use

This pattern isn’t necessary when the agent’s internal capabilities are enough to complete a task. Gathering external data may introduce an unnecessary layer of complexity.

Example

Travel assistants might use this design pattern to use external tools like flight booking APIs when they cannot access live flight information.

Multi-Agent Workflow

Multi-agent workflows involve multiple specialized agents collaborating with defined roles and coordination mechanisms such as voting, debate, or role-based responsibilities. These workflows can be organized in hierarchical structures or peer-to-peer networks, enabling complex task decomposition, parallel processing, and the integration of diverse agent capabilities for comprehensive problem-solving.

When to use

A key application of this paradigm is Agentic Retrieval-Augmented Generation (RAG), where multiple retrieval agents operate in parallel, each optimized for specific data sources, while a central LLM synthesizes the retrieved information. This approach enhances accuracy, mitigates hallucinations, and effectively manages diverse knowledge bases.

When not to use

Multi-agent systems are very computationally intensive. In resource-constrained environments, a single-agent architecture may be more pragmatic.

Example

AI agents in project management can collaborate to accomplish various tasks. One agent could handle task assignments based on team availability, another could monitor timelines and flag risks in real time, and a third agent could generate daily progress summaries.

Hierarchical Agent Workflows

Organizes agents in hierarchical structures with planner agents coordinating specialized executor agents. Enables complex task decomposition, error isolation, and specialized agent optimization while maintaining overall coordination and control.

When to use

It’s especially useful for agentic workflows that require sequencing, coordination, and adaptability across long-running tasks.

When not to use

Using this for simple tasks that don’t require detailed planning would be overkill, and since the results will vary, implementing it for tasks that require predictable results is not advised.

Example

For software development teams, an AI agent might break up a product launch into subtasks like design, development, testing, deployment, and monitoring.

Benefits of agentic workflows

Agentic workflows come with measurable benefits for teams under pressure to move fast without dropping the ball.

Benefits include:

Improved task automation
Scalable across complex workflows
Faster decision-making skills
Less manual oversight across repetitive tasks
Better performance tracking via feedback loops

Top agentic frameworks to watch in 2025

Microsoft AutoGen: A framework that makes it easy to design, manage, and observe collaborative agents working in conversation.

Best for: Orchestrating multi-agent systems with structured dialogues

LangChain: LangChain remains the go-to for building composable, agentic workflows using LLMs and external tools. Its ecosystem supports fast prototyping and real-world deployment.

Best for: Developer-friendly, modular AI workflow construction

LangGraph: Built on LangChain, it is designed for agents who need persistent context and multistep planning. It is great for use cases like document processing, customer support, or multiturn project workflows that require checkpoints and fallbacks.

Best for: Stateful, branching workflows with long-term memory

CrewAI: CrewAI lets you assign “roles” to different agents and coordinate their work like a team.

Best for: Managing multi-agent collaboration with minimal overhead.

Final thoughts

In conclusion, agentic workflows play a transformative role by automating routine tasks such as routing customer inquiries, updating timelines, and managing administrative processes. This not only reduces the burden of repetitive work but also improves overall efficiency, minimizes delays, and enhances personal productivity. As a result, organizations adopting agentic workflows experience greater customer satisfaction and retention, while also benefiting from higher returns on investment due to reduced reliance on constant human oversight.

Agentic AI Workflows Design Patterns, Examples, and what to watch in 2025 was originally published in CodeX on Medium, where people are continuing the conversation by highlighting and responding to this story.

4 Techniques to Handle Imbalanced Datasets

Shanding P. G — Fri, 08 Aug 2025 11:49:47 GMT

Plus, how to recognize fool’s gold when you see it

Techniques to fix imbalanced data using Python.

Identifying fraud, diagnosing a rare disease, predicting customer churn with machine learning — are all difficult because you need to identify the outlier: the fraudulent credit card purchase, the rare disease, the customer who churns. They are challenging to predict correctly due to the lack of examples: Most credit card transactions are normal. Only a few are fraudulent. This uneven mix is known as class imbalance.

In most real-world datasets, one class appears much more often than the other. Class imbalance can lead to misleading model performance and poor predictions — especially for the less frequent class. Class imbalance is a common challenge in machine learning and predictive modeling.

This article discusses the methods you can use to overcome the challenges of imbalance data. We also show you how you can apply these methods in Python.

A graphical depiction of the class imbalance problem.

Observations after assessing the model

As you can see in the right side of the graphic, the imbalanced dataset was highly accurate at predicting the majority class, but performed poorly for the minority class. In contrast, when we rebalanced the data, performance for both classes improved significantly.

Although the overall accuracy in the imbalanced dataset appears to be scientifically higher, this is misleading — a concept often described as “fool’s gold” in data mining literature.

In conclusion, we can say that:

Balanced datasets enable machine learning techniques to yield reasonably high prediction accuracy for both minority and majority classes.
Imbalanced datasets can cause machine learning models to make poor predictions.

Why do imbalanced datasets cause ML models to make poor predictions?

This happens because machine learning algorithms learn from training data in a way that’s similar to how people learn from experience.

In humans (and likely in many animals), memory is influenced by repetition — experiences we see often create more permanent, vivid memories. That makes them easier to remember and recognize later. Rare experiences, on the other hand, may be overlooked or ignored.

In the same way, in machine learning, the learning process can result in biased predictive models, because it’s focused primarily on patterns from the majority class, while neglecting the specifics of the minority class.

Which methods help overcome the imbalanced data problem?

There are several ways to deal with imbalanced data and help predictive models perform better. Here we want to look at four.

Taxonomy of methods used to handle the imbalance data problem

While some approaches are more complex than others, they all aim to ensure that the prediction algorithm pays equal attention to the patterns presented by all classes — majority, minority, and everything in between.

Some methods work by changing the distribution of classes in the training data, while others adjust the learning process, by modifying the algorithm or changing the importance (or cost) of mistakes, to prioritize accurate prediction of the minority class. (See cost-sensitive methods below.)

Note: Cost of performance refers to the consequences assigned to different types of prediction errors. In an imbalanced dataset, misclassifying a rare event (like disease or fraud) can be more serious than misclassifying a common event.

Let’s have a look at these methods in more detail.

#1. Data sampling methods

Data sampling methods are among the most widely used techniques in data science and machine learning due to their simplicity in understanding, formulation, and implementation.

These methods can be categorized into two main classes: oversampling and undersampling.

Oversampling methods

Oversampling methods increase the number of minority class examples in two main ways:

By replicating existing examples until the number is equal to the majority class. This can be done by simply copying the data or through bootstrapping techniques.
By synthetically generating new examples that are similar but not identical to the existing minority class samples.

One well-known technique is SMOTE (Synthetic Minority Oversampling Technique), which uses the k-nearest neighbor algorithm to generate new examples.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

SMOTE works well with datasets that primarily consist of numerical features. But it doesn’t perform as well when the dataset has mostly categorical or nominal variables. Various variants of the SMOTE algorithm have been developed to address the shortcomings of the original method.

Undersampling methods

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

Undersampling methods keep all the minority class examples and randomly select an equal number of examples from the majority class. This means some of the majority class examples are removed from the training data.

The random selection can be done with replacement (where the same example might be picked more than once, known as bootstrapping) or without replacement (where each example is only picked once).

#2. Cost-sensitive methods

One way to handle class imbalance is by focusing on the cost of making mistakes — the cost of misclassifications.

While data sampling methods try to balance the dataset before the training process, cost-sensitive methods change how the model treats different types of classification errors, by adjusting the costs.

Cost-sensitive methods assign different misclassification costs to various classes based on the degree of imbalance. For example, they assign higher costs to misclassification errors involving the minority class, since those errors are often more important.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(class_weight='balanced')
clf.fit(X_train, y_train)

# manually define custom weights
weights = {0: 1, 1: 5}  # Higher weight for minority class

The goal is to either adjust the classification threshold or assign disproportionate costs to enhance the model’s focus on the minority class.

#3. Algorithmic and one-class methods

Another group of methods used to handle imbalanced data involves algorithmic adjustments to different classification algorithms. These are known as algorithmic methods.

While all of these adjusted algorithms aim to reduce the negative impact of imbalance data, they use different techniques to do so.

One of the most well-studied approaches in this area is support vector machines (SVMs) and their variants.

from sklearn.svm import OneClassSVM

model = OneClassSVM(gamma='auto').fit(X_train_majority)
predictions = model.predict(X_test)

The one-class method is another technique used in the machine learning community to tackle class imbalance. The core idea is to focus on just one class at a time during training.

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05)
iso.fit(X_train)

In this approach, the training samples consist solely of a single class label (e.g. only positive or only negative samples) so that it can learn the specific characteristics of that class.

Unlike traditional classification methods which differentiate between multiple classes, this approach is called “recognition-based,” while traditional methods are “discrimination-based.”

The goal with one-class methods is to create a model that is finely tuned to the characteristics of one class and can identify anything else as not belonging to that class.

This method uses three types of unification strategies:

density-based characterization
boundary determination, and
reconstruction or evolution-based modeling.

These strategies are similar in concept to clustering techniques like k-means, k-medoids, k-centers, and self-organizing maps.

#4. Ensemble methods

Ensemble methods have recently emerged as a popular and effective way to handle imbalanced data.

Unlike single prediction models, ensemble methods combine the predictions of multiple models. These can be the same type of model (called homogeneous ensembles) or different types (called heterogeneous ensembles).

Variants of both bagging and boosting have been proposed to deal with class imbalance issues.

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier

model = BalancedBaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    sampling_strategy='auto',
    replacement=False
)
model.fit(X_train, y_train)

In bagging, the data is sampled in a way that gives more attention to the minority class.

from xgboost import XGBClassifier

model = XGBClassifier(scale_pos_weight=10)  # Adjust to balance classes

In boosting, the model increases the weight of minority class examples to help improve their prediction.

Another approach being explored as a potential solution to the class imbalance problem is active learning. Active learning is a methodology that learns iteratively in a piecewise manner. It focuses on the most useful data at each stage to better handle class imbalance.

Looking ahead: The ongoing search for better solutions

Despite numerous efforts to overcome the class imbalance issue in the machine learning community, the current state-of-the-art approaches are limited to heuristic solutions and ad hoc methodologies.

There is still a lack of universally accepted theories, methodologies, and best practices. While many studies claim to have developed data balancing methods that improve prediction accuracy for the minority class, a significant number of them also conclude that these methodologies may degrade prediction accuracy for the majority class and overall classification accuracy.

Ongoing research seeks to address questions such as, “Can a universal methodology yield better prediction results?” or “Can there be an algorithm that prescribes the best data balancing technique for a specific machine learning approach and the data available?”

Enjoyed this post? Check out my other articles for more insights on machine learning, data science, and Python programming.

4 Techniques to Handle Imbalanced Datasets was originally published in CodeX on Medium, where people are continuing the conversation by highlighting and responding to this story.

Supervised Fine-Tuning (SFT) vs. Retrieval-Augmented Generation (RAG)

Shanding P. G — Mon, 26 May 2025 16:45:46 GMT

Introduction

Large Language Models (LLMs) have revolutionized natural language processing (NLP) by enabling machines to generate, understand, and interact with human language at unprecedented levels. However, to optimize their performance for specific tasks or domains, these models often require further enhancement. Two widely adopted strategies for this are Supervised Fine-Tuning (SFT) and Retrieval-Augmented Generation (RAG). While both approaches enhance the capabilities of LLMs, they differ significantly in methodology, data needs, and use cases. This article explores both techniques in depth and offers guidance on when to apply each.

What is Supervised Fine-Tuning (SFT)?

Supervised Fine-Tuning (SFT) refers to the process of adapting a pre-trained language model to a specific task using a labeled dataset. It involves continuing the training of a model on domain-specific examples with known input-output pairs.

How SFT Works

Pretrained Base Model: Start with a general-purpose LLM (e.g., GPT, BERT).
Labeled Dataset: Use a curated dataset containing input-output pairs.
Training: Adjust the model’s parameters to minimize prediction errors on the training data.
Deployment: Deploy the fine-tuned model for the specific downstream task.

Ideal Scenarios for SFT

When high-quality labeled data is available.
For tasks with clear objectives, such as sentiment analysis, summarization, or named entity recognition.
In closed-domain settings where the scope of information is well-defined.

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an architecture that enhances language models by incorporating an external retrieval mechanism. Instead of relying solely on pre-trained knowledge, RAG fetches relevant information from a large corpus in real time and incorporates it into its responses.

How RAG Works

Retriever Module: Searches an external corpus (e.g., Wikipedia, private documents) to find contextually relevant content based on the user query.
Reader (Generator) Module: A language model processes the query along with the retrieved documents to produce an informed response.
Integrated Pipeline: The retriever and reader operate together in an end-to-end workflow.

Ideal Scenarios for RAG

In open-domain tasks that require up-to-date or expansive domain-specific knowledge.
When labeled data is limited, but large volumes of unstructured data are accessible.
For applications like question answering, dynamic customer support, and research assistance.

SFT vs. RAG: A Comparative Analysis

Comparative analysis of SFT and RAG

Complementary Use

SFT and RAG can work in tandem. For example, a model can be fine-tuned via SFT to adopt a desired tone or structure, while RAG provides access to dynamic, up-to-date knowledge. This hybrid approach blends precision with flexibility.

Pros and Cons

Supervised Fine-Tuning (SFT)

Pros:

Delivers high accuracy on well-defined tasks
Allows customization for tone and format
Produces consistent and predictable results

Cons:

Requires significant time and resources for training
Dependent on the availability of labeled data
Poor adaptability to unseen or evolving queries

Retrieval-Augmented Generation (RAG)

Pros:

Provides access to current and domain-specific knowledge
Requires minimal labeled data
Adapts easily across multiple tasks and domains

Cons:

Involves a more complex system architecture
Inference may be slower due to retrieval overhead
Response quality depends on the relevance of retrieved documents

Use Cases

When to Use SFT

Customer Feedback Classification: Fine-tune on labeled feedback data for sentiment analysis.
Legal Document Summarization: Train a model using summaries written by legal professionals.
Healthcare Chatbots: Customize a model based on medical conversations reviewed by experts.

When to Use RAG

Customer Support Chatbots: Retrieve the latest policy documents to handle varied queries.
Academic Research Assistants: Retrieve and summarize relevant scholarly articles.
Enterprise Knowledge Management: Enable staff to query internal documentation without model retraining.

Conclusion

Supervised Fine-Tuning (SFT) and Retrieval-Augmented Generation (RAG) offer distinct advantages for enhancing language models. SFT excels in scenarios with ample labeled data and clearly defined tasks, delivering high accuracy and predictability. RAG, by contrast, thrives in open-ended, knowledge-intensive applications where flexibility and access to real-time information are essential.

Choosing between SFT and RAG depends on your goals, data availability, and operational context. In many situations, a combination of both — using SFT for structure and RAG for content — yields optimal performance. By understanding each approach’s strengths and trade-offs, practitioners can design robust, efficient, and intelligent NLP systems tailored to their needs.