<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Shanding P. G on Medium]]></title>
        <description><![CDATA[Stories by Shanding P. G on Medium]]></description>
        <link>https://medium.com/@pgshanding?source=rss-83d30594ec28------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*TKN6r8Rwf6Bm4nLwbOey-g.jpeg</url>
            <title>Stories by Shanding P. G on Medium</title>
            <link>https://medium.com/@pgshanding?source=rss-83d30594ec28------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 17 May 2026 01:52:51 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@pgshanding/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[A Practical Framework for Explaining Visuals That Actually Drive Insight: TOPT]]></title>
            <link>https://medium.com/@pgshanding/a-practical-framework-for-explaining-visuals-that-actually-drive-insight-topt-a3e754614f6f?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/a3e754614f6f</guid>
            <category><![CDATA[data-analysis]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[analytics]]></category>
            <category><![CDATA[topt]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Mon, 04 May 2026 14:01:01 GMT</pubDate>
            <atom:updated>2026-05-04T14:01:01.347Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="TOPT transform data into insight" src="https://cdn-images-1.medium.com/max/871/1*0gmOZYdjaDnqpwu2EYIhMQ.png" /><figcaption>TOPT: Transform data into insight.</figcaption></figure><p>In data analytics, most people focus heavily on <strong>building dashboards, charts, and models</strong>. But there’s a quieter, more critical skill that often gets overlooked:</p><blockquote><em>Explaining what the data actually means.</em></blockquote><p>You can build the most visually appealing dashboard in Power BI or Tableau, but if you cannot clearly communicate the insight behind it, the value of your analysis drops significantly. This is where <strong>TOPT</strong> comes in.</p><p>TOPT is a <strong>highly effective structure for interpreting and communicating data visualizations</strong>. It is widely used in training environments and by practitioners who care about clarity and decision-making.</p><h3>What is TOPT?</h3><p>TOPT is a simple framework that helps you explain any data visualization in a structured, logical way. It stands for:</p><ul><li><strong>T — Title (or Key Message)</strong></li><li><strong>O — Overview</strong></li><li><strong>P — Pattern</strong></li><li><strong>T — Takeaway</strong></li></ul><p>At its core, TOPT forces you to move from <strong>“what is this chart showing?”</strong> to <strong>“why does this matter?”</strong></p><figure><img alt="TOPT step by step overview" src="https://cdn-images-1.medium.com/max/1024/1*euaozVoheq13v11K3qr2JQ.png" /><figcaption>TOPT step by step overview</figcaption></figure><h3>Why Most Analysts Struggle with Explaining Visuals</h3><p>Before diving deeper into TOPT, it’s worth understanding the common failure modes in analytics communication:</p><h3>1. Description without insight</h3><blockquote>“This chart shows sales by month…”</blockquote><blockquote><strong>That’s not analysis. That’s narration.</strong></blockquote><h3>2. Unstructured observations</h3><blockquote>“Sales increased here, dropped there, and also something happened in March…”</blockquote><blockquote><strong>This creates cognitive overload and weakens your message.</strong></blockquote><h3>3. No clear takeaway</h3><p>The audience is left wondering:</p><blockquote>“So what should I do with this?”</blockquote><h3>4. Over-reliance on visuals</h3><p>Many assume the chart “speaks for itself.” It doesn’t.</p><blockquote><strong><em>Charts support thinking — they don’t replace explanation.</em></strong></blockquote><h3>The TOPT Framework Explained</h3><p>Let’s break down each component in a way that reflects real analytical thinking.</p><h3>1. T — Title (Key Message)</h3><p>This is the <strong>headline insight</strong>.</p><blockquote>Not the chart title like: <strong><em>“Monthly Sales Data”</em></strong></blockquote><blockquote>But the actual message: <strong><em>“Sales grew consistently after Q1 due to improved conversion rates.”</em></strong></blockquote><p>This does two things:</p><ul><li>Anchors your audience immediately</li><li>Forces you to clarify your own thinking</li></ul><blockquote><em>If you can’t write the title clearly, you probably don’t understand the insight yet.</em></blockquote><h3>2. O — Overview</h3><p>Now you provide context.</p><ul><li>What data is being shown?</li><li>What variables are involved?</li><li>What time frame or segmentation exists?</li></ul><p>Example:</p><blockquote>“This chart shows monthly revenue from January to June 2026 across all product categories.”</blockquote><p>Keep this concise. The goal is orientation, not analysis.</p><h3>3. P — Pattern</h3><p>This is where analytical thinking becomes visible.</p><p>You identify:</p><ul><li>Trends (upward/downward)</li><li>Variations (spikes, dips)</li><li>Comparisons (categories, segments)</li><li>Anomalies (outliers, unexpected behavior)</li></ul><p>Example:</p><blockquote>“Revenue declined slightly in March but increased steadily from April through June, with the steepest growth between May and June.”</blockquote><p>At this stage, you are answering: <strong><em>What is happening in the data?</em></strong></p><h3>4. T — Takeaway</h3><p>This is the most important step.</p><p>You interpret the pattern and connect it to meaning:</p><ul><li>Why is this happening?</li><li>What does it imply?</li><li>What decision or action does it inform?</li></ul><p>Example:</p><blockquote>“The post-March growth suggests the new marketing campaign improved customer acquisition and conversion rates.”</blockquote><p>This is where you transition from <strong>analysis → insight → value</strong>.</p><h3>A Full Example Using TOPT</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/636/1*9orUdaO0_DNaFsfYjDS6hw.png" /><figcaption>Sample Chart</figcaption></figure><p>Let’s combine everything into a single narrative:</p><blockquote><strong><em>Title:</em></strong><em> Sales growth accelerated after Q1 due to improved conversion strategies</em></blockquote><blockquote><strong><em>Overview:</em></strong><em> This chart shows monthly revenue from January to June 2026</em></blockquote><blockquote><strong><em>Pattern:</em></strong><em> Revenue dipped slightly in March but increased consistently from April onward, with the highest jump in June</em></blockquote><blockquote><strong><em>Takeaway:</em></strong><em> The upward trend suggests that the new campaign launched in April significantly improved conversion rates</em></blockquote><p>Notice how structured, clear, and decision-ready this is.</p><h3>Why TOPT Works</h3><h3>1. It enforces clarity</h3><p>You cannot hide behind vague explanations.</p><h3>2. It improves stakeholder communication</h3><p>Executives and non-technical audiences don’t want raw data — they want <strong>meaning</strong>.</p><h3>3. It builds analytical discipline</h3><p>You move through a logical chain:</p><blockquote><em>Context → Observation → Interpretation → Insight</em></blockquote><h3>4. It scales across tools</h3><p>Whether you’re using:</p><ul><li>Excel</li><li>Power BI</li><li>Tableau</li><li>Python (Matplotlib/Seaborn)</li></ul><p>TOPT remains applicable.</p><h3>When to Use TOPT</h3><p>TOPT is especially useful in:</p><ul><li>Dashboard walkthroughs</li><li>Business presentations</li><li>Stakeholder reports</li><li>Data storytelling sessions</li><li>Training and teaching analytics</li></ul><p>If you are presenting <strong>any chart</strong>, you should be thinking in TOPT.</p><h3>Common Mistakes When Using TOPT</h3><h3>1. Weak or generic titles</h3><blockquote><em>“Sales Data Overview” → not useful</em></blockquote><h3>2. Skipping the takeaway</h3><p>This turns your explanation into commentary instead of insight.</p><h3>3. Mixing overview and pattern</h3><p>Keep context separate from analysis.</p><h3>4. Overcomplicating the explanation</h3><p>TOPT is meant to simplify, not add jargon.</p><h3>Practical Tip: Use TOPT in Dashboards</h3><p>If you’re building dashboards:</p><ul><li>Use the <strong>Title</strong> as your insight headline</li><li>Add a short <strong>Overview</strong> in tooltips or subtitles</li><li>Highlight <strong>Patterns</strong> with annotations</li><li>Include <strong>Takeaways</strong> in summary cards or notes</li></ul><p>This turns dashboards from passive visuals into <strong>decision tools</strong>.</p><h3>Final Thoughts</h3><p>TOPT is simple, but it addresses a fundamental gap in analytics:</p><blockquote><em>The gap between </em><strong><em>seeing data</em></strong><em> and </em><strong><em>understanding it</em></strong><em>.</em></blockquote><p>In a world filled with dashboards and metrics, the real differentiator is not who can build charts — it’s who can <strong>explain them clearly and extract meaning</strong>.</p><p>If you adopt TOPT consistently, you’ll notice a shift:</p><ul><li>Your explanations become sharper</li><li>Your insights become clearer</li><li>Your impact becomes more tangible</li></ul><p>Because at the end of the day, analytics is not about charts.</p><p>It’s about <strong>decisions informed by insight</strong>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a3e754614f6f" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Clean Code for Career Changers: How to Write Code Professionals Are Proud Of]]></title>
            <link>https://medium.com/@pgshanding/clean-code-for-career-changers-how-to-write-code-professionals-are-proud-of-8aea221d944b?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/8aea221d944b</guid>
            <category><![CDATA[software-engineering]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[clean-code]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Thu, 08 Jan 2026 15:57:28 GMT</pubDate>
            <atom:updated>2026-01-08T15:57:28.943Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/800/1*MrGbyJRRcEViXz01LDebNA.jpeg" /></figure><p>Entering the world of software development can feel overwhelming. You learn syntax, frameworks, tools, and then someone drops a term like “clean code” on you and suddenly you’re questioning everything you just wrote.</p><blockquote>At its core, clean code isn’t about perfection; it’s about clarity and maintainability. You want code that doesn’t just work, but that the next engineer (or future you) can easily read, understand, and improve.</blockquote><h3>What Is Clean Code and Why It Matters</h3><p>Clean code refers to code that is easy to read, understand, and modify. It’s foundational to scalable software development.</p><p>Clean code becomes crucial when:</p><ul><li>You’re collaborating with other developers.</li><li>You’re onboarding onto an existing project.</li><li>You need to maintain, extend, or debug software long after it was first written.</li></ul><h3>Practical Principles of Clean Code</h3><h4>1. Use Meaningful Names</h4><p>Using descriptive names helps anyone reading the code quickly understand what each variable or function represents, reducing confusion and errors.</p><p><strong>Bad:</strong></p><pre>x = 10<br>y = 20<br>z = x + y</pre><p><strong>Good:</strong></p><pre>apples_count = 10<br>oranges_count = 20<br>total_fruits = apples_count + oranges_count</pre><h4>2. Keep Functions Short and Focused</h4><p>Small, focused functions are easier to test, maintain, and reuse. Each function should perform a single, well-defined task.</p><p><strong>Bad:</strong></p><pre>def process_order(order):<br>    validate_order(order)<br>    calculate_discount(order)<br>    apply_discount(order)<br>    send_invoice(order)</pre><p><strong>Good:</strong></p><pre>def validate_order(order):<br>    # validation logic<br>    pass<br><br>def calculate_discount(order):<br>    # discount logic<br>    pass<br><br>def apply_discount(order):<br>    # apply discount logic<br>    pass<br><br>def send_invoice(order):<br>    # send invoice logic<br>    pass</pre><h4>3. Avoid Hard-Coded Values</h4><p>Using named constants makes your code more readable and easier to maintain, and prevents errors when values change.</p><p><strong>Bad:</strong></p><pre>total_price = quantity * 0.1</pre><p><strong>Good:</strong></p><pre>DISCOUNT_RATE = 0.1<br>total_price = quantity * DISCOUNT_RATE</pre><h4>4. Write Comments That Add Value</h4><p>Comments should explain the reasoning behind the code, not repeat what the code already shows. This helps future readers understand the “why”.</p><p><strong>Bad:</strong></p><pre># increment x by 1<br>x += 1</pre><p><strong>Good:</strong></p><pre># increment score for correct answer<br>score += 1</pre><h4>5. Follow Established Style Guidelines</h4><p>Consistent formatting improves readability and collaboration across teams, making it easier to maintain a shared codebase.</p><p><strong>Bad:</strong></p><pre>function myFunc(){console.log(&quot;hello&quot;)}</pre><p><strong>Good:</strong></p><pre>function myFunc() {<br>    console.log(&quot;hello&quot;);<br>}</pre><h4>6. Refactor Continuously</h4><p>Regularly revisiting and improving code reduces complexity and prevents technical debt, keeping the codebase healthy over time.</p><p><strong>Bad:</strong></p><pre>if user_type == &#39;admin&#39;:<br>    permissions = [&#39;read&#39;,&#39;write&#39;,&#39;delete&#39;]<br>elif user_type == &#39;editor&#39;:<br>    permissions = [&#39;read&#39;,&#39;write&#39;]<br>elif user_type == &#39;viewer&#39;:<br>    permissions = [&#39;read&#39;]</pre><p><strong>Good:</strong></p><pre>USER_PERMISSIONS = {<br>    &#39;admin&#39;: [&#39;read&#39;,&#39;write&#39;,&#39;delete&#39;],<br>    &#39;editor&#39;: [&#39;read&#39;,&#39;write&#39;],<br>    &#39;viewer&#39;: [&#39;read&#39;]<br>}<br>permissions = USER_PERMISSIONS.get(user_type, [])</pre><h3>How AI Can Help You Write Cleaner Code</h3><p><strong>Code Suggestions and Autocomplete:</strong> AI tools like GitHub Copilot suggest idiomatic code as you type.</p><p><strong>AI-Assisted Code Review and Quality Checks:</strong> Platforms can automatically detect duplication, complexity, or style violations.</p><p><strong>Automated Test Generation:</strong> AI can generate unit tests that enforce good code structure.</p><p><strong>Real-Time Coding Feedback:</strong> AI-augmented IDEs provide instant suggestions to improve readability and maintainability.</p><p><strong>Balancing AI with Human Judgment:</strong> Always review AI-generated code to ensure correctness, security, and alignment with project goals.</p><p>Clean code is a hallmark of professionalism. For anyone upskilling or pivoting into tech, mastering these practices early positions you as a reliable, thoughtful developer. Pair these principles with AI tools strategically, and you’ll not only write code that works — you’ll write code that’s good.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8aea221d944b" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Dashboardification of the Data Analyst]]></title>
            <link>https://medium.com/@pgshanding/the-dashboardification-of-the-data-analyst-28e282db601d?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/28e282db601d</guid>
            <category><![CDATA[dashboard]]></category>
            <category><![CDATA[data-analysis]]></category>
            <category><![CDATA[data-visualization]]></category>
            <category><![CDATA[business-intelligence]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Fri, 05 Dec 2025 12:28:56 GMT</pubDate>
            <atom:updated>2025-12-05T18:20:49.845Z</atom:updated>
            <content:encoded><![CDATA[<p>A reflection on how analytics drifted from rigorous inquiry to dashboard assembly lines.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/680/0*QXR0CZkECe2GRs6J" /></figure><p>Over the past decade, data analysis has undergone an identity crisis — one shaped by bootcamps, social media hype, and the mass migration of people “breaking into tech.” Somewhere between the endless carousel of <em>“Top 10 Power BI Interview Questions”</em> and the explosion of two-week certificate courses, the data analyst morphed from an investigator, a thinker, a translator of ambiguity… into a dashboard factory.</p><p>It didn’t happen overnight. But today, the profession sits in a strange place where many analysts are celebrated not for the quality of their thinking, but for how visually pleasing their business intelligence tool of choice can make a bar chart look.</p><p>And so we arrive at the <strong>Dashboardification Era</strong>.</p><h3>When Everyone Became a Data Analyst</h3><p>The promise was seductive:</p><p><em>“Learn SQL, Excel, and Power BI. Earn in dollars. Work from anywhere.”</em></p><p>Influencers and training programs told a generation that data analysis was the new gold rush — low barrier to entry, high pay, and a straightforward skillset. And in some ways, they weren’t wrong. Tools did become easier. Visualization platforms became drag-and-drop. Cloud analytics made storage trivial. Even Python and SQL felt less intimidating as the ecosystem matured.</p><p>But with that wave came something else: <strong>mass saturation</strong>.</p><p>Suddenly, thousands of aspiring analysts flooded the market, each armed with the same dashboards, the same resume template, the same “projects” built from the same publicly available datasets. In many portfolios, one could barely find a problem statement or a question worth answering — just charts wrapped in vibrant color palettes.</p><p><strong><em>The field was no longer about thinking. It was about tool usage.</em></strong></p><h3>The Erosion of First Principles</h3><p>Before the boom, data analysis was fundamentally about <strong>reasoning</strong>:</p><ul><li>What question are we answering?</li><li>Why does the data look like this?</li><li>What assumptions underlie these numbers?</li><li>What does the business need to understand?</li><li>What is the story behind the anomaly?</li></ul><p>Today, many new analysts can build a dashboard but cannot explain whether an observed trend is statistically significant, or whether a spike represents signal or noise. Some can join tables but cannot articulate <em>why</em> the tables should be joined that way. Many can calculate KPIs but cannot describe whether those KPIs actually matter to the business.</p><p>It’s not their fault — not entirely.<br>The industry incentivized speed over depth, aesthetics over accuracy, and tool proficiency over conceptual mastery.</p><p>In hiring pipelines, recruiters ask:</p><p>Text within this block will maintain its original spacing when published</p><blockquote>“Can they use Power BI?” <br>instead of <br>“Can they think?”</blockquote><p>The result is predictable:<br><strong><em>A generation of analysts optimized to build dashboards but not to build insight.</em></strong></p><h3>BI Tools as a Crutch, Not a Craft</h3><p>Power BI, Tableau, Looker, Metabase — these tools are incredible. They democratized analytics. They empowered teams. They reduced friction.</p><p>But somewhere along the way, the profession mistook the tool for the work.</p><p>We forgot:</p><ul><li>A dashboard is not an analysis.</li><li>A chart is not a conclusion.</li><li>A metric is not an insight.</li><li>A beautiful report is not a business impact.</li></ul><p>In the worst cases, teams become addicted to <strong>“dashboard theatre”</strong> — the production of visually stimulating but strategically empty artifacts that create the illusion of intelligence.</p><p>Text within this block will maintain its original spacing when published</p><blockquote>Executives say, “Show me the dashboard,”<br>when they should be saying,<br>“Help me understand what matters.”</blockquote><h3>A Saturated Field Creates Shallow Incentives</h3><p>With so many people trying to enter the profession, the path of least resistance has become the default:</p><ul><li>Learn a few DAX functions</li><li>Build a portfolio with three dashboards</li><li>Upload to GitHub</li><li>Share on LinkedIn with the caption “My first Power BI project!”</li><li>Repeat</li></ul><p>But this approach strips the soul from the craft.<br>It replaces curiosity with templates.<br>It reduces analysis to decoration.</p><p>And worst of all, it creates the illusion that data work is easy — that it begins and ends with a dashboard.</p><h3>What We Lost Along the Way</h3><p>The dashboardification of analytics has quietly eroded:</p><h4>1. Problem-solving discipline</h4><p>Asking “why” five times. Challenging assumptions. Understanding causality. Testing hypotheses.</p><h4>2. Statistical thinking</h4><p>Sample bias, confidence intervals, distributions, outliers — too often ignored.</p><h4>3. Business acumen</h4><p>Knowing what decisions matter. Understanding incentives. Connecting insights to actions.</p><h4>4. Narrative clarity</h4><p>Data storytelling beyond color gradients — true communication, not decoration.</p><h4>5. Intellectual craftsmanship</h4><p>Treating analysis as an act of thinking, not an act of clicking.</p><h3>But It’s Not All Doom</h3><p>The field is not dead — it’s simply noisy.</p><p>True analysts still exist.<br>They are the ones who:</p><ul><li>Start with a question, not a dashboard.</li><li>Refuse to visualize until they understand.</li><li>Can defend every assumption in their SQL query.</li><li>See data not as pixels, but as decisions.</li><li>Know when NOT to build a dashboard.</li></ul><p>These analysts are not threatened by the saturation.<br>They rise above it.</p><h3>The Way Back: A Rebellious Proposal</h3><p>If the profession wants to reclaim its identity, we need a new ethos:</p><h3>1. Less tooling, more thinking</h3><blockquote>The tool should serve the mind, not replace it.</blockquote><h4>2. Prioritize problem statements, not “projects”</h4><p>We don’t need more dashboards; we need more questions.</p><h4>3. Make dashboards the end, not the beginning</h4><p>A dashboard is a <em>delivery mechanism,</em> not the analysis itself.</p><h4>4. Celebrate insight, not aesthetics</h4><p>The most beautiful chart is the one that changes a decision.</p><h3>The Real Call to Action</h3><p>This is not a critique of “new analysts.”<br>It is a critique of what the industry — intentionally or not — told them the profession should be.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=28e282db601d" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[The Compliance Wake-Up Call for Data-Driven Organizations]]></title>
            <link>https://medium.com/@pgshanding/the-compliance-wake-up-call-for-data-driven-organizations-a1865fa70691?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/a1865fa70691</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[gdpr]]></category>
            <category><![CDATA[data-protection]]></category>
            <category><![CDATA[data-governance]]></category>
            <category><![CDATA[compliance]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Mon, 20 Oct 2025 09:22:12 GMT</pubDate>
            <atom:updated>2025-10-20T09:22:12.588Z</atom:updated>
            <content:encoded><![CDATA[<p>Why Data Governance Must Become Part of Your Business DNA</p><blockquote>“In today’s world, data governance isn’t optional — it’s survival.”</blockquote><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SVo6QZ9R2-dm-ArO.png" /></figure><p>For a long time, data governance sat quietly in the background — a technical chore for IT teams, filed somewhere between system backups and password policies. But the world has changed.</p><p>Today, every click, campaign, and customer conversation is powered by data. And when data drives everything, governance can’t live in a corner anymore.</p><p>Welcome to the new reality — where compliance is a boardroom conversation, and data responsibility is everyone’s job.</p><p>Regulations like GDPR (Europe), CCPA (California), POPIA (South Africa), and NDPR (Nigeria) have redrawn the lines of accountability. A single mistake — a misplaced file, an unprotected record, an unauthorized access — can lead to crippling fines, reputational damage, or loss of public trust.</p><p>The message is clear:</p><blockquote>“Data governance is no longer about where data lives — it’s about how it’s used, shared, and protected by everyone in the organization.”</blockquote><p>From HR managing employee information to marketing teams running personalized campaigns, every department now handles sensitive data. And that means every department shares liability.</p><p>The organizations that lead in this era are the ones that don’t wait for compliance deadlines — they build governance into their DNA, treating it as a strategic advantage, not a bureaucratic burden.</p><h3>Why Governance Frameworks Fail — And What to Do About It</h3><p><em>It’s easy to design a policy. It’s hard to make people live by it.</em></p><p>Most governance frameworks fail not because they’re poorly written, but because they’re poorly adopted. Employees see them as obstacles, not enablers — as red tape, not responsibility.</p><p>And when culture resists, compliance collapses.</p><blockquote>“A governance framework without cultural buy-in is like a security system everyone ignores.”</blockquote><p>Technology can help, but it can’t fix culture. Yes, tools like eDiscovery, data classification platforms, and cloud compliance monitors can automatically detect and secure sensitive information. But they can’t replace human accountability.</p><p>That’s why the most compliant organizations don’t start with software — they start with people.</p><p>They:<br>1. Train teams to understand the “why,” not just the “what.”<br>2. Communicate clearly from leadership down to every desk.<br>3. Reward responsible data behavior the same way they reward innovation.</p><p>When employees see that governance <strong>protects their work, reputation, and the company’s integrity</strong>, it stops being a box to tick — it becomes <strong>part of the culture</strong>.</p><h3>Making Governance Everyone’s Responsibility</h3><p>In too many companies, governance is treated like a legal problem or a technical checklist. But data doesn’t care about your org chart — it flows across systems, teams, and continents.</p><p>That means governance must, too.</p><p>The smartest organizations are tearing down silos by appointing <strong>data stewards — </strong>champions within departments who act as local custodians of good data practices. These stewards bridge the gap between <strong>policy</strong> and <strong>execution</strong>, making governance visible in day-to-day workflows.</p><p>Meanwhile, cross-functional collaboration is non-negotiable.</p><p>When <strong>marketing aligns with IT</strong>, <strong>HR collaborates with compliance</strong>, and <strong>leadership sets the tone</strong>, governance stops being an afterthought and becomes an <strong>operational standard</strong>.</p><blockquote><em>“You can’t govern what you can’t see. Visibility across teams is the foundation of accountability.”</em></blockquote><p>Unified dashboards, clear definitions, and transparent ownership turn governance from a concept into a living, breathing process.</p><h3>Governance, Trust, and the NDPR Advantage</h3><p>Let’s be clear: compliance isn’t just about avoiding fines — it’s about <strong>earning trust</strong>.</p><p>Customers and regulators are becoming more vigilant, and in Nigeria, the <strong>NDPR (Nigeria Data Protection Regulation)</strong> has raised the bar. The NDPR isn’t a local inconvenience — it’s a <strong>signal</strong> that Nigeria’s digital economy is maturing, demanding accountability from every data-driven organization.</p><p>Companies that proactively implement NDPR-aligned governance aren’t just protecting themselves — they’re positioning for <strong>global competitiveness</strong>.</p><blockquote><em>“Trust is the new currency. And data governance is how you mint it.”</em></blockquote><p>When you handle data ethically and transparently, you don’t just comply — you <strong>differentiate</strong>.</p><h3>From Policy to Power: The Leadership Imperative</h3><p>Data governance has evolved from a technical necessity into a <strong>strategic weapon</strong>.</p><p>Boards are realizing that <strong>poor data management isn’t an IT risk — it’s a business risk</strong>. Data breaches destroy trust. Misuse of personal information drives customers away. And regulatory fines can cripple growth.</p><p>Strong governance, on the other hand, builds <strong>resilience</strong>, <strong>efficiency</strong>, and <strong>credibility</strong>.</p><blockquote><em>“Good governance doesn’t slow innovation — it accelerates it by removing uncertainty.”</em></blockquote><p>Nigeria’s business landscape is digital, connected, and accelerating fast. Leaders who ignore governance are gambling with their organization’s future. Those who embrace it will lead industries — not react to them.</p><h3>Final Thought</h3><p>Compliance isn’t paperwork.<br><strong>It’s protection.<br>It’s reputation.<br>It’s leadership.</strong></p><p>Whether you’re a <strong>CTO building systems</strong>, a <strong>data analyst managing insights</strong>, or a <strong>CEO shaping strategy</strong>, the message is the same:<br><strong>Governance is everyone’s job.</strong></p><p>When organizations embed trust, transparency, and accountability into how they handle data, they don’t just meet regulations — they <strong>set the standard</strong>.</p><blockquote><em>“In the data economy, trust is your greatest asset — and governance is how you earn it.”</em></blockquote><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a1865fa70691" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building Trust in Your Data: A Practical Guide to Data Quality Tests]]></title>
            <link>https://python.plainenglish.io/building-trust-in-your-data-a-practical-guide-to-data-quality-tests-a326eef3bb95?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/a326eef3bb95</guid>
            <category><![CDATA[python]]></category>
            <category><![CDATA[data-analysis]]></category>
            <category><![CDATA[great-expectations]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-quality]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Mon, 06 Oct 2025 15:15:05 GMT</pubDate>
            <atom:updated>2025-10-15T14:55:01.803Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*R5rh2hMljPLTpCUI.png" /></figure><p>In today’s data-driven world, businesses are only as good as the data they rely on. But here’s the hard truth: not all data can be trusted. Bad data sneaks in quietly, leading to poor insights, flawed decisions, and expensive mistakes.</p><p>That’s why data quality testing is no longer optional — it’s a necessity. If you’re just getting started, this guide walks you through the most important types of data quality tests, why they matter, and — most importantly — how to implement them in Python using Great Expectations.</p><h3>Why Data Quality Testing Matters</h3><p>Imagine building a house on sand. That’s what it’s like to make strategic decisions without checking the reliability of your data.</p><p>A good data quality process ensures:</p><ul><li><strong>Confidence</strong> — You know your data is accurate, complete, and usable.</li><li><strong>Efficiency</strong> — Your team spends less time firefighting data issues.</li><li><strong>Better decisions</strong> — Analytics, AI models, and dashboards give real insight, not noise.</li></ul><h3>The Pitfalls of Anomaly Detection</h3><blockquote>Anomaly detection is often the first step teams take. It’s appealing because it’s automated, standardized, and easy to apply across lots of data. But here’s the catch: anomaly detection is low information<strong>.</strong></blockquote><p><strong>What anomaly detection can tell you</strong></p><ul><li>When and where something unusual happened</li><li>The characteristics of the anomaly</li></ul><p><strong>What it <em>can’t</em> tell you</strong></p><ul><li>The business significance of the anomaly</li><li>Whether it’s urgent</li><li>Which stakeholders are affected</li></ul><p>This often leads to alert fatigue — hundreds of signals but little actionable insight.</p><p><strong>Use anomaly detection sparingly. It’s best for</strong></p><ul><li>Monitoring <strong>new or unfamiliar data sources</strong></li><li>Establishing a <strong>baseline</strong> when you lack stakeholder knowledge</li></ul><p>Over time, replace anomaly detection with <strong>specific, high-information tests</strong>.</p><h3>The Basics: Four Practical Dimensions of Data Quality</h3><p>When starting out, focus on these <strong>four core test dimensions</strong>. They cover the most common data issues and give you strong confidence in your pipelines.</p><h4>1. Missingness</h4><p><strong>Goal:</strong> Ensure data isn’t unexpectedly missing — or appearing when it shouldn’t.</p><p><strong>Example:</strong> In a customer dataset, email should never be null.</p><pre>import great_expectations as gx<br>import pandas as pd<br><br># Sample dataset<br>data = pd.DataFrame({<br>    &quot;customer_id&quot;: [1, 2, 3, 4],<br>    &quot;email&quot;: [&quot;a@test.com&quot;, &quot;b@test.com&quot;, None, &quot;d@test.com&quot;]<br>})<br><br># Create a GE context from dataframe<br>context = gx.get_context()<br>datasource = context.sources.add_pandas(&quot;my_datasource&quot;)<br>data_asset = datasource.add_dataframe_asset(&quot;customers&quot;)<br>batch_request = data_asset.build_batch_request(dataframe=data)<br><br># Create Expectation Suite<br>suite = context.suites.add(&quot;missingness_suite&quot;)<br><br># Null check: email should not be null<br>validator = context.get_validator(<br>    batch_request=batch_request,<br>    suite=suite<br>)<br>validator.expect_column_values_to_not_be_null(&quot;email&quot;)<br><br># Run and view results<br>results = validator.validate()<br>print(results)</pre><p>In the example above, the test will flag the missing email value in row 3.</p><h4>2. Schema</h4><p><strong>Goal:</strong> Verify that expected columns are present, in the right order, and of the right type.</p><p><strong>Example:</strong> You expect customer_id (int) and email (string) to exist.</p><pre># Expect specific schema<br>validator.expect_table_columns_to_match_set(<br>    column_set=[&quot;customer_id&quot;, &quot;email&quot;]<br>)<br><br># Expect correct data types<br>validator.expect_column_values_to_be_of_type(&quot;customer_id&quot;, &quot;int64&quot;)<br>validator.expect_column_values_to_be_of_type(&quot;email&quot;, &quot;object&quot;)</pre><h4>3. Volume</h4><p><strong>Goal:</strong> Check that the number of rows is within expected bounds.</p><p><strong>Example:</strong> Yesterday’s transactions table usually has 10k–15k rows.</p><pre># Expect row count to be within range<br>validator.expect_table_row_count_to_be_between(min_value=10000, max_value=15000)</pre><h4>4. Ranges</h4><p><strong>Goal:</strong> Ensure numbers and dates fall within acceptable ranges.</p><p><strong>Example:</strong> order_amount should always be &gt; 0</p><pre># Expect order_amount column values to be positive<br>validator.expect_column_values_to_be_between(&quot;order_amount&quot;, min_value=1)</pre><h3>Taking It to the Next Level</h3><p>Once the basics are in place, you can implement <strong>advanced tests</strong>. These go beyond simple presence and ranges to ensure your data is <strong>valid, unique, and relationally consistent</strong></p><h4>Validity Tests</h4><p><strong>Goal:</strong> Check that values are plausible, not just present. Validity ensures that the data falls within logical or business-defined sets, ranges, or patterns.</p><p><strong>Example:</strong> status must always be one of {Pending, Active, Closed}.</p><pre># Expect status column values to belong to a fixed set<br>validator.expect_column_values_to_be_in_set(<br>    &quot;status&quot;, [&quot;Pending&quot;, &quot;Active&quot;, &quot;Closed&quot;]<br>)</pre><p>You can also apply <strong>numeric validity</strong> checks (e.g., salaries, dates) or <strong>pattern validity</strong> (e.g., emails match regex).</p><h4>Uniqueness Tests</h4><p><strong>Goal:</strong> Ensure that values that should be unique actually are unique (e.g., primary keys), and that values expected to be diverse aren’t unexpectedly duplicated.</p><p><strong>Example:</strong> Each customer_id must be unique.</p><pre># Expect customer_id column values to be unique<br>validator.expect_column_values_to_be_unique(&quot;customer_id&quot;)</pre><p>This prevents duplicate records from creeping into your dataset.</p><h4>Referential Integrity Tests</h4><p><strong>Goal:</strong> Validate relationships between columns or tables. Referential integrity ensures that related data is properly connected.</p><p><strong>Example 1:</strong> Cross-column integrity → A day=31 should never pair with month=February.</p><pre># Check cross-column integrity<br>validator.expect_column_pair_values_to_be_equal(<br>    &quot;orders.customer_id&quot;, &quot;customers.customer_id&quot;<br>)</pre><p><strong>Example 2:</strong> Cross-table integrity → Every customer_id in the orders table must exist in the customers table.</p><pre># Check referential integrity between orders and customers<br>validator.expect_multicolumn_values_to_be_unique(<br>    column_list=[&quot;order_id&quot;, &quot;customer_id&quot;]<br>)</pre><p>This ensures that your <strong>foreign keys</strong> and relationships remain consistent.</p><h3>Final Thoughts</h3><blockquote>Data quality isn’t about perfection — it’s about trust.</blockquote><p>Start small with the basics: <strong>missingness, schema, volume, and ranges.</strong> Then iterate with more advanced tests as your pipelines mature.</p><p>By replacing vague anomaly alerts with <strong>explicit, expressive tests</strong>, you’ll create a system that not only detects problems but also tells you exactly what action to take.</p><p>The result?</p><ul><li>Fewer surprises</li><li>Faster fixes</li><li>And ultimately, <strong>data you can rely on</strong></li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/210/0*FGJ_kDIR_4zNv3cZ" /><figcaption>Great Expectation</figcaption></figure><p><strong>Want to dive deeper?</strong> Check out <a href="https://greatexpectations.io">Great Expectations</a>, the open-source framework powering these examples.</p><h3>A message from our Founder</h3><p><strong>Hey, </strong><a href="https://linkedin.com/in/sunilsandhu"><strong>Sunil</strong></a><strong> here.</strong> I wanted to take a moment to thank you for reading until the end and for being a part of this community.</p><p>Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? <strong>We don’t receive any funding, we do this to support the community. ❤️</strong></p><p>If you want to show some love, please take a moment to <strong>follow me on </strong><a href="https://linkedin.com/in/sunilsandhu"><strong>LinkedIn</strong></a><strong>, </strong><a href="https://tiktok.com/@messyfounder"><strong>TikTok</strong></a>, <a href="https://instagram.com/sunilsandhu"><strong>Instagram</strong></a>. You can also subscribe to our <a href="https://newsletter.plainenglish.io/"><strong>weekly newsletter</strong></a>.</p><p>And before you go, don’t forget to <strong>clap</strong> and <strong>follow</strong> the writer️!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a326eef3bb95" width="1" height="1" alt=""><hr><p><a href="https://python.plainenglish.io/building-trust-in-your-data-a-practical-guide-to-data-quality-tests-a326eef3bb95">Building Trust in Your Data: A Practical Guide to Data Quality Tests</a> was originally published in <a href="https://python.plainenglish.io">Python in Plain English</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Free & Open-Source Starter Toolkit for Data Scientists]]></title>
            <link>https://medium.com/@pgshanding/building-a-free-open-source-starter-toolkit-for-data-scientists-b317985e8ee0?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/b317985e8ee0</guid>
            <category><![CDATA[api]]></category>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Mon, 06 Oct 2025 15:01:56 GMT</pubDate>
            <atom:updated>2025-10-06T15:01:56.260Z</atom:updated>
            <content:encoded><![CDATA[<h3>The Practical Data Scientist’s Free and Open-Source Toolkit</h3><p>If you’re a data scientist or an aspiring one, you know the challenge: juggling datasets, cleaning pipelines, running experiments, sharing results, and monitoring models in production. Paid tools can help, but what if you could build a complete workflow with free and open-source software?</p><p>In this post, we’ll walk through a practical starter stack that covers the entire lifecycle — from data management to monitoring your APIs. The best part? All of these tools are free, open source, and production-ready.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Oblnwqc8KTo5isJCRJdbig.png" /><figcaption>Data Science workflow with tools</figcaption></figure><blockquote>You don’t need enterprise subscriptions to achieve professional‑grade workflows. The combination of api‑analytics, MLflow, Streamlit, and the others forms a powerful, modular stack you can customize for your own projects</blockquote><h4><strong>High-Level Workflow Diagram</strong></h4><p>This flow shows how the tools complement each other: versioning, validation, experiments, sharing, and monitoring.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SPCgglNPW1bc0-mpd7r42w.png" /></figure><h4>1. Monitoring API Usage with api-analytics</h4><p><a href="https://www.apianalytics.dev/">api-analytics</a> is a lightweight open-source tool for monitoring your APIs. It tracks request counts, response times, and error rates.</p><p>Setting up api-analytics is quite easy and straight forward. Below is a sample code snippet that shows how to setup api-analytics for an API with flask. It also supports Django, FastAPI, tornado, and it is also in other languages such as JavaScript and GO.</p><pre>from flask import Flask<br>from api_analytics.flask import add_middleware<br><br>app = Flask(__name__)<br>add_middleware(app, &lt;API-KEY&gt;)  # Add middleware<br><br>@app.get(&#39;/&#39;)<br>def root():<br>    return jsonify(<br>      {&#39;message&#39;: &#39;Hello, World!&#39;}<br>    )<br><br>@app.route(&quot;/predict&quot;)<br>def predict():<br>  return jsonify(<br>    {&quot;prediction&quot;: 42}<br>  )<br><br>if name == &quot;__main__&quot;:<br>  app.run(debug=True)</pre><h4>2. Tracking Experiments with MLflow + DVC</h4><blockquote>Managing experiments can get messy. MLflow keeps track of runs, parameters, and results, while DVC handles dataset and model versioning.</blockquote><p><a href="https://mlflow.org/"><strong>MLflow</strong></a></p><p>MLflow is an open-source platform, purpose-built to assist machine learning practitioners and teams in handling the complexities of the machine learning process. MLflow focuses on the full lifecycle for machine learning projects, ensuring that each phase is manageable, traceable, and reproducible.</p><pre>import mlflow<br>import mlflow.sklearn<br>from sklearn.linear_model import LogisticRegression<br>from sklearn.datasets import load_iris<br>from sklearn.model_selection import train_test_split<br><br>X, y = load_iris(return_X_y=True)<br>X_train, X_test, y_train, y_test = train_test_split(X, y)<br><br>mlflow.start_run():<br>model = LogisticRegression(max_iter=200)<br>model.fit(X_train, y_train)<br><br>acc = model.score(X_test, y_test)<br>mlflow.log_param(&quot;max_iter&quot;, 200)<br>mlflow.log_metric(&quot;accuracy&quot;, acc)<br>mlflow.sklearn.log_model(model, &quot;logreg_model&quot;)</pre><p><a href="https://dvc.org/"><strong>DVC</strong></a></p><p>Data Version Control (DVC) lets you capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. It also provides a mechanism to switch between these different data contents. The result is a single history for data, code, and ML models that you can traverse — a proper journal of your work!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/700/0*hx8wjo_H6XLDO3ay.png" /><figcaption><em>DVC matches the right versions of data, code, and models for you 💘</em></figcaption></figure><pre># Initialize DVC in your project<br>dvc init<br><br># Track a dataset<br>dvc add data/raw_dataset.csv<br>git add data/raw_dataset.csv.dvc .gitignore<br>git commit -m &quot;Track raw dataset with DVC&quot;</pre><p>This way, you can reproduce any experiment with exact data + parameters.</p><h4>3. Ensuring Data Quality with Great Expectations</h4><p>Great Expectations (GE) is an open-source Python library designed for data validation, profiling, and quality control. It allows you to define, test, and document your data expectations in a structured and automated way. It integrates seamlessly with popular data tools like Pandas, SQL databases, and Spark.</p><pre>from great_expectations.dataset import PandasDataset<br>import pandas as pd<br><br># Load data<br>df = pd.read_csv(&quot;data/raw_dataset.csv&quot;)<br>validated_df = PandasDataset(df)<br><br># Add expectations<br>validated_df.expect_column_values_to_not_be_null(&quot;age&quot;)<br>validated_df.expect_column_values_to_be_between(&quot;age&quot;, 18, 90)<br>results = validated_df.validate()<br>print(results)</pre><h4>4. Sharing Results with Streamlit</h4><p><a href="https://streamlit.io/"><strong>Streamlit </strong></a>is an open-source framework designed to create and share beautiful web applications for data science and machine learning projects. It is specifically built for Python, making it easy for data scientists and machine learning engineers to deploy their models and visualize data without needing extensive knowledge of web development. Streamlit makes it easy to turn scripts into apps.</p><pre>import streamlit as st<br>import pandas as pd<br>import joblib<br><br>model = joblib.load(&quot;logreg_model.pkl&quot;)<br><br>st.title(&quot;Iris Classifier&quot;)<br>sepal_length = st.slider(&quot;Sepal Length&quot;, 4.0, 8.0)<br>sepal_width = st.slider(&quot;Sepal Width&quot;, 2.0, 5.0)<br><br>X_new = pd.DataFrame([[sepal_length, sepal_width, 3.5, 1.4]])<br>prediction = model.predict(X_new)<br><br>st.write(f&quot;Prediction: {prediction[0]}&quot;)<br></pre><pre>streamlit run app.py</pre><h4>5. BI Dashboards with Metabase or Superset</h4><ul><li><a href="https://www.metabase.com/"><strong>Metabase </strong></a>→ Metabase is an open-source business intelligence (BI) tool that allows users to create interactive dashboards, visualize data, and analyze insights without requiring advanced technical skills. Dashboards in Metabase aggregate multiple data visualizations, metrics, and interactive elements into a single interface, enabling users to monitor and analyze key performance indicators (KPIs) and trends effectively.</li><li><a href="https://superset.apache.org/"><strong>Superset </strong></a>→ Apache Superset is an open-source data exploration and visualization platform designed for creating interactive dashboards and analyzing data efficiently. It is a modern alternative to proprietary business intelligence tools, offering a wide range of features for users of all skill levels.</li></ul><p>Both connect to databases and let you build dashboards without coding.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6b8VQ25QhTbaDNu3QC_WjQ.png" /></figure><h4>6. Synthetic Data &amp; Annotation with Faker + Label Studio</h4><p>Sometimes you need test data or labeled datasets.</p><p><a href="https://faker.readthedocs.io/en/master/"><strong>Faker</strong></a></p><p>Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.</p><pre>from faker import Faker<br><br>fake = Faker()<br>for  in range(5):<br>  print(<br>    {<br>      &quot;name&quot;: fake.name(),<br>      &quot;email&quot;: fake.email(),<br>      &quot;transaction&quot;: fake.randomnumber()<br>    }<br>  )</pre><p><a href="https://labelstud.io/"><strong>Label Studio</strong></a></p><p>Label Studio is an open source data labeling tool. It lets you label data types like audio, text, images, videos, and time series with a simple and straightforward UI and export to various model formats.</p><pre>pip install label-studio<br>label-studio start</pre><p>This launches a UI for annotating images, text, or tabular data.</p><h4>Putting It All Together</h4><p>With this toolkit:</p><ul><li><strong>api-analytics</strong> → Monitor your API usage</li><li><strong>MLflow + DVC</strong> → Track &amp; reproduce experiments</li><li><strong>Great Expectations</strong> → Validate data quality</li><li><strong>Streamlit</strong> → Build interactive apps</li><li><strong>Metabase / Superset</strong> → Share dashboards</li><li><strong>Faker + Label Studio</strong> → Generate &amp; annotate datasets</li></ul><h4>End-to-End Workflow Visualization</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Bt_XlsQ70l-kKEDtzOFfAw.png" /></figure><p>We have explored the entire data science journey — from sourcing and validating data to experimenting, visualizing, and monitoring real-world deployments. Every stage can be handled with free, open‑source tools that scale with your growth and creativity.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b317985e8ee0" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[A Practical Guide to Building a Winning Data Strategy]]></title>
            <link>https://medium.com/@pgshanding/a-practical-guide-to-building-a-winning-data-strategy-c02180021573?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/c02180021573</guid>
            <category><![CDATA[data]]></category>
            <category><![CDATA[data-governance]]></category>
            <category><![CDATA[data-analytics]]></category>
            <category><![CDATA[data-strategy]]></category>
            <category><![CDATA[data-science]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Sat, 06 Sep 2025 11:01:32 GMT</pubDate>
            <atom:updated>2025-09-06T11:01:32.560Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*rBy9Wh3ll4xtVcwNBXN_vQ.png" /></figure><p>In today’s data-driven world, organizations are collecting more data than ever before. But having data is not the same as using it effectively. To turn raw information into a strategic asset, businesses need a well-crafted data strategy — a roadmap that guides how data is collected, stored, managed, governed, and leveraged to achieve business goals.</p><h3>What is a Data Strategy?</h3><blockquote>A data strategy is a comprehensive roadmap that defines how an organization will collect, store, manage, govern, and leverage its data assets to achieve specific business goals.</blockquote><p>It is not just about technology — it’s about aligning people, processes, and tools to unlock the value of data. Simply put, a good data strategy ensures that every data initiative, whether it’s analytics, reporting, AI, or compliance, directly contributes to the organization’s success.</p><p>A strong data strategy typically:</p><ul><li><strong>Aligns with business goals</strong> to ensure data is used for measurable impact.</li><li><strong>Outlines the technical architecture</strong> needed to store and process data.</li><li><strong>Defines governance frameworks</strong> for quality, privacy, and security.</li><li><strong>Establishes clear roles and responsibilities</strong> for data management.</li><li><strong>Prioritizes investments in tools and talent</strong> to maximize value.</li><li><strong>Provides measurement mechanisms</strong> to track progress and drive continuous improvement.</li></ul><h4>Defining Vision and Goals</h4><p>Every strategy begins with a vision, and data strategy is no different. The vision should be <strong><em>clear</em></strong>, <strong><em>aspirational</em></strong>, and <strong><em>aligned with broader organizational goals</em></strong>.</p><blockquote>For instance, an e-commerce company might <em>aim to increase customer retention by 15% through personalized recommendations based on user behavior data.</em></blockquote><p>But vision must be paired with measurable goals. Defining specific data objectives — such as improving reporting accuracy, reducing time to insights, or enhancing compliance monitoring — helps ensure that the strategy moves beyond theory into action. Equally important is communicating this vision across the organization to foster a shared data-driven culture.</p><h3>The Three Phases of Data Strategy</h3><p>A practical framework for data strategy involves three interconnected phases: <strong>Plan</strong>, <strong>Build</strong>, and <strong>Operate</strong>.</p><h4>1. Plan: Laying the Foundation</h4><p>Planning sets the groundwork by defining governance, architecture, and skill requirements.</p><ul><li><strong>Data Governance</strong>: Establish clear roles (data owners, stewards, users), enforce policies around data quality, access, and security, and create governance mechanisms to ensure compliance and resolve conflicts.</li><li><strong>Data Architecture</strong>: Design a flexible, scalable architecture that addresses current and future needs. Choose appropriate storage and processing systems (e.g., data lakes, warehouses, cloud) and implement ETL pipelines for smooth data flow.</li><li><strong>Talents &amp; Skills</strong>: Identify skill gaps, attract and retain data talent, and invest in upskilling existing employees in analytics and literacy to foster organization-wide competence.</li></ul><h4>2. Build: Creating Capabilities</h4><p>Once the foundation is set, the focus shifts to building capabilities that transform raw data into meaningful insights.</p><ul><li><strong>Data Quality</strong>: Define key quality dimensions such as accuracy, timeliness, and completeness. Implement data profiling and cleansing processes to fix errors and set up monitoring to ensure ongoing reliability.</li><li><strong>Data Analytics</strong>: Invest in a robust analytics stack that supports diagnostic, predictive, and prescriptive analytics. Develop semantic models, algorithms, and visualization tools accessible to both technical experts and business users.</li><li><strong>Data Security &amp; Privacy</strong>: Protect data through security controls, encryption, and monitoring. Ensure compliance with regulations like GDPR or CCPA and provide regular training to employees to strengthen data security culture.</li></ul><h4>3. Operate: Driving Continuous Improvement</h4><p>A data strategy is not static — it evolves with business priorities, regulatory requirements, and technological advancements.</p><ul><li><strong>Change Management</strong>: Treat data strategy as an ongoing journey. Continuously communicate its benefits, engage stakeholders, and foster adaptability in response to evolving data landscapes.</li><li><strong>Technology &amp; Infrastructure</strong>: Regularly evaluate and update tools to meet organizational needs. Maintain reliable infrastructure while experimenting with emerging technologies like machine learning, generative AI, and real-time analytics.</li><li><strong>Metrics &amp; Measurements</strong>: Define KPIs to measure progress (e.g., revenue uplift, customer satisfaction, cost reduction). Build dashboards to track performance, and use these insights to refine and optimize the strategy.</li></ul><h4>Why Data Strategy Matters</h4><p>Organizations that fail to manage data strategically often face challenges such as:</p><ul><li>Siloed and inconsistent data.</li><li>Poor decision-making due to lack of insights.</li><li>Higher risks of security breaches and compliance violations.</li><li>Wasted resources on duplicate or inefficient data projects.</li></ul><p>By contrast, organizations with a strong data strategy benefit from:</p><ul><li><strong>Operational efficiency</strong>: streamlined processes, fewer delays, and better productivity.</li><li><strong>Smarter decision-making</strong>: high-quality insights powering executive decisions.</li><li><strong>Customer satisfaction</strong>: personalized, data-driven experiences.</li><li><strong>Regulatory compliance</strong>: stronger protection against legal and reputational risks.</li><li><strong>Higher ROI</strong>: reduced oversight needs and better use of human and technical resources.</li></ul><p>A well-crafted data strategy is not a one-off project — it’s a living framework that evolves as business needs and technologies change. The ultimate goal is to make data a trusted, secure, and strategic asset that drives innovation, growth, and resilience.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c02180021573" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Agentic AI Workflows Design Patterns, Examples, and what to watch in 2025]]></title>
            <link>https://medium.com/codex/agentic-ai-workflows-design-patterns-examples-and-what-to-watch-in-2025-a3602b19b7e8?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/a3602b19b7e8</guid>
            <category><![CDATA[design-patterns]]></category>
            <category><![CDATA[ai-agent]]></category>
            <category><![CDATA[automation]]></category>
            <category><![CDATA[ai-workflow]]></category>
            <category><![CDATA[agentic-workflow]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Sat, 30 Aug 2025 11:01:36 GMT</pubDate>
            <atom:updated>2025-09-02T13:41:38.312Z</atom:updated>
            <content:encoded><![CDATA[<p>As Large Language Models (LLMs) evolve into autonomous agents, understanding agentic workflow design patterns has become essential for building robust agentic AI systems.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/640/0*VLB--vHlgXmLGJrx.png" /></figure><p>As AI systems mature, agentic workflows are emerging as the backbone of how Large Language Models (LLMs) and intelligent agents function in practice. These workflows define how agents perceive, plan, act, collaborate, and improve over time. In this article, we explore how design patterns and enterprise applications intersect to form the foundation of agentic AI.</p><p>With major enterprises already investing in these systems to scale faster and cut manual effort, the question isn’t <em>if</em> your organization will adopt agentic workflows, but <em>how soon</em> before they do so.</p><p>But what are agentic workflows, anyway?</p><p><strong>What is an agentic AI workflow?</strong></p><p>Agentic workflows are automated systems that combine artificial intelligence (AI) and machine learning (ML) to manage and execute tasks. These workflows are designed to take over repetitive and routine activities, allowing human workers to focus on high-value tasks that require creativity, strategic thinking, and decision-making.</p><blockquote>Unlike generative AI, which produces content when prompted, agentic AI proactively manages tasks, coordinates steps, and executes workflows toward goals. This makes it more suitable for complex, real-world environments where context, adaptability, and human collaboration are essential.</blockquote><p><strong>What makes a workflow “agentic?”</strong></p><p>To qualify as an agentic workflow, the AI system must operate autonomously, adapt its behavior based on outcomes, and work across multiple apps or environments while aligning with defined goals.</p><p>Characteristics of an agentic workflow:</p><ul><li><strong>Autonomous decision-making: </strong>helps analyze inputs, weigh options, and make the next move.</li><li><strong>Contextual awareness: </strong>ensures intelligent agents pull from real-time data and historical inputs to guide decisions.</li><li><strong>Goal-oriented: </strong>designed to work toward specific goals, such as meeting a delivery deadline or reducing resource strain.</li><li><strong>Adapt in real time: </strong>continuously recalculate based on changes and reconfigure actions to stay aligned.</li></ul><p><strong>Components of agentic systems</strong></p><ol><li><strong>AI agents:</strong> These are the autonomous workhorses of agentic systems. AI agents perform tasks, make real-time decisions, and adapt based on data, goals, and feedback. They’re digital teammates that can triage tickets, reschedule tasks, or reroute priorities without waiting on a human handoff. AI agents learn from feedback and stored memory so they can continuously improve their performance over time.</li><li><strong>Large language models (LLMs): </strong>This is the reasoning layer that gives agents their brains. LLMs like GPT-5 or Claude allow agents to interpret goals, follow instructions, and communicate in natural language. This ability to understand context, plan next steps, and troubleshoot issues separates generative AI from agentic AI.</li><li><strong>Tools: </strong>Agents don’t live in isolation. Agentic workflows are often integrated with multiple data sources, such as project management software, CRM tools, and communication platforms. This integration ensures that agents have access to relevant, up-to-date information to perform tasks efficiently. Standards like the Model Context Protocol (MCP) come into play here, giving AI agents a plug-and-play interface to connect with real-world apps and services.</li><li><strong>Prompt engineering: </strong>This is how we tell agents <em>what</em> to do and <em>how</em> to do it. Effective prompt engineering gives AI agents the guidance they need to interpret complex tasks and execute them accurately. The better the prompt, the better the output, especially in multistep workflows where nuance matters.</li><li><strong>Multi-agent collaboration: </strong>One agent is helpful. Many agents working in sync? That’s powerful. Multi-agent collaboration allows several specialized AI agents, like a scheduling agent, a data analysis agent, and a compliance agent, to work together on larger goals.</li></ol><h3>Agentic design patterns</h3><p>Agentic workflows aren’t built from scratch every time. They rely on strategic, repeatable execution modes called design patterns. These patterns serve as architectural blueprints for how AI agents behave in complex workflows. Whether it’s triaging customer tickets or managing cross-functional launches, each pattern represents a reliable way to deploy agentic workflows that scale.</p><p>Let’s break down some of the most effective agentic design patterns.</p><h3>Policy-Only Workflows</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qUmfNHGx1MCiYVadAnVZsQ.png" /></figure><h4>How it works</h4><p>Uses LLMs as direct policy models to generate actions or plans without search or feedback loops.</p><blockquote>ReAct gives AI agents the ability to talk themselves through problems, then do something about them. Instead of frontloading a plan or waiting for a complete reflection, this pattern has the AI agent think step by step and take action as needed, toggling between analysis and execution in real time.</blockquote><h4>When to use</h4><p>Best for well-defined tasks with clear action sequences.</p><h4>When not to use</h4><p>This pattern isn’t useful for static predefined answers. Reasoning could add lag, and using external tools would limit effectiveness.</p><h4>Example</h4><p>A typical example is ReAct for question answering and Plan-and-Solve for math problems.</p><h3>Feedback-Learning Workflows</h3><p>Iteratively improves responses through feedback from self-reflection, tools, environment, or humans. Incorporates learning loops where agents analyze their performance and refine future attempts.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QSGmGbq5UmN0HzrptyN4wg.png" /></figure><h4>When to use</h4><p>Ideal for tasks requiring continuous improvement and error correction capabilities, like generating code or complex problem solving. It’s also useful when there’s a high cost of mistakes or risk of compounding errors.</p><h4>When not to use</h4><p>Avoid this pattern for complex multistep tasks and tasks that require logic or reasoning.</p><h3>Workflow Orchestration/Agentic Process Automation</h3><p>Automates complex business processes by orchestrating APIs and tools through LLM-driven workflow generation. This pattern is often enabled via protocols like MCP, which give agents a standardized way to understand and interact with tools in their environment.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*knE6fskiZck-dgv14RnV4w.png" /></figure><h4>When to use</h4><p>Shifts from manual Robotic Process Automation to intelligent Agentic Process Automation. Ideal for enterprise automation, API integration, and dynamic workflow adaptation.</p><h4>When not to use</h4><p>This pattern isn’t necessary when the agent’s internal capabilities are enough to complete a task. Gathering external data may introduce an unnecessary layer of complexity.</p><h4>Example</h4><p>Travel assistants might use this design pattern to use external tools like flight booking APIs when they cannot access live flight information.</p><h3>Multi-Agent Workflow</h3><p>Multi-agent workflows involve multiple specialized agents collaborating with defined roles and coordination mechanisms such as voting, debate, or role-based responsibilities. These workflows can be organized in hierarchical structures or peer-to-peer networks, enabling complex task decomposition, parallel processing, and the integration of diverse agent capabilities for comprehensive problem-solving.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Q9NMOUUDFY76JuKOXrXLeQ.png" /></figure><h4>When to use</h4><p>A key application of this paradigm is Agentic Retrieval-Augmented Generation (RAG), where multiple retrieval agents operate in parallel, each optimized for specific data sources, while a central LLM synthesizes the retrieved information. This approach enhances accuracy, mitigates hallucinations, and effectively manages diverse knowledge bases.</p><h4>When not to use</h4><p>Multi-agent systems are very computationally intensive. In resource-constrained environments, a single-agent architecture may be more pragmatic.</p><h4>Example</h4><p>AI agents in project management can collaborate to accomplish various tasks. One agent could handle task assignments based on team availability, another could monitor timelines and flag risks in real time, and a third agent could generate daily progress summaries.</p><h3>Hierarchical Agent Workflows</h3><p>Organizes agents in hierarchical structures with planner agents coordinating specialized executor agents. Enables complex task decomposition, error isolation, and specialized agent optimization while maintaining overall coordination and control.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*LOs_kMikzwK-3xf46QRcGQ.png" /></figure><h4>When to use</h4><p>It’s especially useful for agentic workflows that require sequencing, coordination, and adaptability across long-running tasks.</p><h4>When not to use</h4><p>Using this for simple tasks that don’t require detailed planning would be overkill, and since the results will vary, implementing it for tasks that require predictable results is not advised.</p><h4>Example</h4><p>For software development teams, an AI agent might break up a product launch into subtasks like design, development, testing, deployment, and monitoring.</p><h3>Benefits of agentic workflows</h3><p>Agentic workflows come with measurable benefits for teams under pressure to move fast without dropping the ball.</p><p>Benefits include:</p><ul><li>Improved task automation</li><li>Scalable across complex workflows</li><li>Faster decision-making skills</li><li>Less manual oversight across repetitive tasks</li><li>Better performance tracking via feedback loops</li></ul><h3>Top agentic frameworks to watch in 2025</h3><p><a href="https://www.microsoft.com/en-us/research/project/autogen/"><strong>Microsoft AutoGen</strong></a><strong>: </strong>A framework that makes it easy to design, manage, and observe collaborative agents working in conversation.</p><p><strong>Best for: </strong>Orchestrating multi-agent systems with structured dialogues</p><p><a href="https://www.langchain.com/"><strong>LangChain</strong></a><strong>: </strong>LangChain remains the go-to for building composable, agentic workflows using LLMs and external tools. Its ecosystem supports fast prototyping and real-world deployment.</p><p><strong>Best for:</strong> Developer-friendly, modular AI workflow construction</p><p><a href="https://www.langchain.com/langgraph"><strong>LangGraph</strong></a>: Built on LangChain, it is designed for agents who need persistent context and multistep planning. It is great for use cases like document processing, customer support, or multiturn project workflows that require checkpoints and fallbacks.</p><p><strong>Best for:</strong> Stateful, branching workflows with long-term memory</p><p><a href="https://www.crewai.com/"><strong>CrewAI</strong></a><strong>: </strong>CrewAI lets you assign “roles” to different agents and coordinate their work like a team.</p><p><strong>Best for:</strong> Managing multi-agent collaboration with minimal overhead.</p><h3>Final thoughts</h3><p>In conclusion, agentic workflows play a transformative role by automating routine tasks such as routing customer inquiries, updating timelines, and managing administrative processes. This not only reduces the burden of repetitive work but also improves overall efficiency, minimizes delays, and enhances personal productivity. As a result, organizations adopting agentic workflows experience greater customer satisfaction and retention, while also benefiting from higher returns on investment due to reduced reliance on constant human oversight.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=a3602b19b7e8" width="1" height="1" alt=""><hr><p><a href="https://medium.com/codex/agentic-ai-workflows-design-patterns-examples-and-what-to-watch-in-2025-a3602b19b7e8">Agentic AI Workflows Design Patterns, Examples, and what to watch in 2025</a> was originally published in <a href="https://medium.com/codex">CodeX</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[4 Techniques to Handle Imbalanced Datasets]]></title>
            <link>https://medium.com/codex/4-techniques-to-handle-imbalanced-datasets-f0eab38eee3d?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/f0eab38eee3d</guid>
            <category><![CDATA[imbalanced-data]]></category>
            <category><![CDATA[data-processing]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[python]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Fri, 08 Aug 2025 11:49:47 GMT</pubDate>
            <atom:updated>2025-09-04T12:00:50.942Z</atom:updated>
            <content:encoded><![CDATA[<p>Plus, how to recognize fool’s gold when you see it</p><figure><img alt="Illustration of Python methods to handle imbalanced datasets." src="https://cdn-images-1.medium.com/max/1024/1*LaWP9kyffgJ-K4Xa-Rc6FA.png" /><figcaption>Techniques to fix imbalanced data using Python.</figcaption></figure><p>Identifying fraud, diagnosing a rare disease, predicting customer churn with machine learning — are all difficult because you need to identify the outlier: the fraudulent credit card purchase, the rare disease, the customer who churns. They are challenging to predict correctly due to the lack of examples: Most credit card transactions are normal. Only a few are fraudulent. This uneven mix is known as class imbalance.</p><p>In most real-world datasets, one class appears much more often than the other. Class imbalance can lead to misleading model performance and poor predictions — especially for the less frequent class. Class imbalance is a common challenge in machine learning and predictive modeling.</p><p>This article discusses the methods you can use to overcome the challenges of imbalance data. We also show you how you can apply these methods in Python.</p><figure><img alt="A graphical depiction of the class imbalance problem." src="https://cdn-images-1.medium.com/max/1024/0*aTTlqvZHtL_iIvkV.png" /><figcaption>A graphical depiction of the class imbalance problem.</figcaption></figure><h3>Observations after assessing the model</h3><p>As you can see in the right side of the graphic, the imbalanced dataset was highly accurate at predicting the majority class, but performed poorly for the minority class. In contrast, when we rebalanced the data, performance for both classes improved significantly.</p><p>Although the overall accuracy in the imbalanced dataset appears to be scientifically higher, this is misleading — a concept often described as “<strong><em>fool’s gold</em></strong>” in data mining literature.</p><p>In conclusion, we can say that:</p><ul><li>Balanced datasets enable machine learning techniques to yield reasonably high prediction accuracy for both minority and majority classes.</li><li>Imbalanced datasets can cause machine learning models to make poor predictions.</li></ul><h3>Why do imbalanced datasets cause ML models to make poor predictions?</h3><p>This happens because machine learning algorithms learn from training data in a way that’s similar to how people learn from experience.</p><p>In humans (and likely in many animals), memory is influenced by repetition — experiences we see often create more permanent, vivid memories. That makes them easier to remember and recognize later. Rare experiences, on the other hand, may be overlooked or ignored.</p><p>In the same way, in machine learning, the learning process can result in biased predictive models, because it’s focused primarily on patterns from the majority class, while neglecting the specifics of the minority class.</p><h3>Which methods help overcome the imbalanced data problem?</h3><p>There are several ways to deal with imbalanced data and help predictive models perform better. Here we want to look at four.</p><figure><img alt="Taxonomy of methods used to handle the imbalance data problem" src="https://cdn-images-1.medium.com/max/1024/0*HYvWQcLN2BOlVRxf.png" /><figcaption>Taxonomy of methods used to handle the imbalance data problem</figcaption></figure><p>While some approaches are more complex than others, they all aim to ensure that the prediction algorithm pays equal attention to the patterns presented by all classes — majority, minority, and everything in between.</p><p>Some methods work by changing the distribution of classes in the training data, while others adjust the learning process, by modifying the algorithm or changing the importance (or cost) of mistakes, to prioritize accurate prediction of the minority class. (See cost-sensitive methods below.)</p><blockquote>Note: Cost of performance refers to the consequences assigned to different types of prediction errors. In an imbalanced dataset, misclassifying a rare event (like disease or fraud) can be more serious than misclassifying a common event.</blockquote><p>Let’s have a look at these methods in more detail.</p><h3>#1. Data sampling methods</h3><p>Data sampling methods are among the most widely used techniques in data science and machine learning due to their simplicity in understanding, formulation, and implementation.</p><p>These methods can be categorized into two main classes: oversampling and undersampling.</p><h4>Oversampling methods</h4><p>Oversampling methods increase the number of minority class examples in two main ways:</p><ol><li>By replicating existing examples until the number is equal to the majority class. This can be done by simply copying the data or through bootstrapping techniques.</li><li>By synthetically generating new examples that are similar but not identical to the existing minority class samples.</li></ol><p>One well-known technique is SMOTE (Synthetic Minority Oversampling Technique), which uses the k-nearest neighbor algorithm to generate new examples.</p><pre>from imblearn.over_sampling import SMOTE<br><br>smote = SMOTE()<br>X_resampled, y_resampled = smote.fit_resample(X_train, y_train)</pre><p>SMOTE works well with datasets that primarily consist of numerical features. But it doesn’t perform as well when the dataset has mostly categorical or nominal variables. Various variants of the SMOTE algorithm have been developed to address the shortcomings of the original method.</p><h4>Undersampling methods</h4><pre>from imblearn.under_sampling import RandomUnderSampler<br><br>rus = RandomUnderSampler()<br>X_resampled, y_resampled = rus.fit_resample(X_train, y_train)</pre><p>Undersampling methods keep all the minority class examples and randomly select an equal number of examples from the majority class. This means some of the majority class examples are removed from the training data.</p><p>The random selection can be done with replacement (where the same example might be picked more than once, known as bootstrapping) or without replacement (where each example is only picked once).</p><h3>#2. Cost-sensitive methods</h3><p>One way to handle class imbalance is by focusing on the cost of making mistakes — the cost of misclassifications.</p><p>While data sampling methods try to balance the dataset before the training process, cost-sensitive methods change how the model treats different types of classification errors, by adjusting the costs.</p><p>Cost-sensitive methods assign different misclassification costs to various classes based on the degree of imbalance. For example, they assign higher costs to misclassification errors involving the minority class, since those errors are often more important.</p><pre>from sklearn.ensemble import RandomForestClassifier<br><br>clf = RandomForestClassifier(class_weight=&#39;balanced&#39;)<br>clf.fit(X_train, y_train)<br><br># manually define custom weights<br>weights = {0: 1, 1: 5}  # Higher weight for minority class</pre><p>The goal is to either adjust the classification threshold or assign disproportionate costs to enhance the model’s focus on the minority class.</p><h3>#3. Algorithmic and one-class methods</h3><p>Another group of methods used to handle imbalanced data involves algorithmic adjustments to different classification algorithms. These are known as algorithmic methods.</p><p>While all of these adjusted algorithms aim to reduce the negative impact of imbalance data, they use different techniques to do so.</p><p>One of the most well-studied approaches in this area is support vector machines (SVMs) and their variants.</p><pre>from sklearn.svm import OneClassSVM<br><br>model = OneClassSVM(gamma=&#39;auto&#39;).fit(X_train_majority)<br>predictions = model.predict(X_test)</pre><p>The <strong>one-class method</strong> is another technique used in the machine learning community to tackle class imbalance. The core idea is to focus on just one class at a time during training.</p><pre>from sklearn.ensemble import IsolationForest<br><br>iso = IsolationForest(contamination=0.05)<br>iso.fit(X_train)</pre><p>In this approach, the training samples consist solely of a single class label (e.g. only positive or only negative samples) so that it can learn the specific characteristics of that class.</p><p>Unlike traditional classification methods which differentiate between multiple classes, this approach is called “<strong>recognition-based</strong>,” while traditional methods are “<strong>discrimination-based</strong>.”</p><p>The goal with one-class methods is to create a model that is finely tuned to the characteristics of one class and can identify anything else as not belonging to that class.</p><p>This method uses three types of unification strategies:</p><ol><li>density-based characterization</li><li>boundary determination, and</li><li>reconstruction or evolution-based modeling.</li></ol><p>These strategies are similar in concept to clustering techniques like k-means, k-medoids, k-centers, and self-organizing maps.</p><h3>#4. Ensemble methods</h3><p>Ensemble methods have recently emerged as a popular and effective way to handle imbalanced data.</p><p>Unlike single prediction models, ensemble methods combine the predictions of multiple models. These can be the same type of model (called homogeneous ensembles) or different types (called heterogeneous ensembles).</p><p>Variants of both bagging and boosting have been proposed to deal with class imbalance issues.</p><pre>from imblearn.ensemble import BalancedBaggingClassifier<br>from sklearn.tree import DecisionTreeClassifier<br><br>model = BalancedBaggingClassifier(<br>    base_estimator=DecisionTreeClassifier(),<br>    sampling_strategy=&#39;auto&#39;,<br>    replacement=False<br>)<br>model.fit(X_train, y_train)</pre><ul><li>In bagging, the data is sampled in a way that gives more attention to the minority class.</li></ul><pre>from xgboost import XGBClassifier<br><br>model = XGBClassifier(scale_pos_weight=10)  # Adjust to balance classes</pre><ul><li>In boosting, the model increases the weight of minority class examples to help improve their prediction.</li></ul><blockquote>Another approach being explored as a potential solution to the class imbalance problem is active learning. Active learning is a methodology that learns iteratively in a piecewise manner. It focuses on the most useful data at each stage to better handle class imbalance.</blockquote><h3>Looking ahead: The ongoing search for better solutions</h3><p>Despite numerous efforts to overcome the class imbalance issue in the machine learning community, the current state-of-the-art approaches are limited to heuristic solutions and ad hoc methodologies.</p><p>There is still a lack of universally accepted theories, methodologies, and best practices. While many studies claim to have developed data balancing methods that improve prediction accuracy for the minority class, a significant number of them also conclude that these methodologies may degrade prediction accuracy for the majority class and overall classification accuracy.</p><p>Ongoing research seeks to address questions such as, “Can a universal methodology yield better prediction results?” or “Can there be an algorithm that prescribes the best data balancing technique for a specific machine learning approach and the data available?”</p><p><strong><em>Enjoyed this post?</em></strong> Check out my other articles for more insights on machine learning, data science, and Python programming.</p><ul><li><a href="https://medium.com/@pgshanding/building-trust-in-ai-aa063e144cbf">Building Trust in AI: Five Pillars of Ethical AI</a></li><li><a href="https://medium.com/@pgshanding/advanced-prompting-techniques-for-enhanced-ai-performance-3c8814ceecf7">Advanced Prompting Techniques for Enhanced AI Performance</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f0eab38eee3d" width="1" height="1" alt=""><hr><p><a href="https://medium.com/codex/4-techniques-to-handle-imbalanced-datasets-f0eab38eee3d">4 Techniques to Handle Imbalanced Datasets</a> was originally published in <a href="https://medium.com/codex">CodeX</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Supervised Fine-Tuning (SFT) vs. Retrieval-Augmented Generation (RAG)]]></title>
            <link>https://medium.com/@pgshanding/supervised-fine-tuning-sft-vs-retrieval-augmented-generation-rag-c8e67295ceba?source=rss-83d30594ec28------2</link>
            <guid isPermaLink="false">https://medium.com/p/c8e67295ceba</guid>
            <category><![CDATA[retrieval-augmented-gen]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[supervised-fine-tuning]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[Shanding P. G]]></dc:creator>
            <pubDate>Mon, 26 May 2025 16:45:46 GMT</pubDate>
            <atom:updated>2025-05-26T16:45:46.195Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="SFT vs RAG" src="https://cdn-images-1.medium.com/max/1024/1*a7Xt3urzNqMpXh4nW8X08A.png" /></figure><h3>Introduction</h3><p>Large Language Models (LLMs) have revolutionized natural language processing (NLP) by enabling machines to generate, understand, and interact with human language at unprecedented levels. However, to optimize their performance for specific tasks or domains, these models often require further enhancement. Two widely adopted strategies for this are <strong>Supervised Fine-Tuning (SFT)</strong> and <strong>Retrieval-Augmented Generation (RAG)</strong>. While both approaches enhance the capabilities of LLMs, they differ significantly in methodology, data needs, and use cases. This article explores both techniques in depth and offers guidance on when to apply each.</p><h3>What is Supervised Fine-Tuning (SFT)?</h3><blockquote>Supervised Fine-Tuning (SFT) refers to the process of adapting a pre-trained language model to a specific task using a labeled dataset. It involves continuing the training of a model on domain-specific examples with known input-output pairs.</blockquote><h3>How SFT Works</h3><ul><li><strong>Pretrained Base Model</strong>: Start with a general-purpose LLM (e.g., GPT, BERT).</li><li><strong>Labeled Dataset</strong>: Use a curated dataset containing input-output pairs.</li><li><strong>Training</strong>: Adjust the model’s parameters to minimize prediction errors on the training data.</li><li><strong>Deployment</strong>: Deploy the fine-tuned model for the specific downstream task.</li></ul><h3>Ideal Scenarios for SFT</h3><ul><li>When high-quality labeled data is available.</li><li>For tasks with clear objectives, such as sentiment analysis, summarization, or named entity recognition.</li><li>In closed-domain settings where the scope of information is well-defined.</li></ul><h3>What is Retrieval-Augmented Generation (RAG)?</h3><blockquote>Retrieval-Augmented Generation (RAG) is an architecture that enhances language models by incorporating an external retrieval mechanism. Instead of relying solely on pre-trained knowledge, RAG fetches relevant information from a large corpus in real time and incorporates it into its responses.</blockquote><h3>How RAG Works</h3><ul><li><strong>Retriever Module</strong>: Searches an external corpus (e.g., Wikipedia, private documents) to find contextually relevant content based on the user query.</li><li><strong>Reader (Generator) Module</strong>: A language model processes the query along with the retrieved documents to produce an informed response.</li><li><strong>Integrated Pipeline</strong>: The retriever and reader operate together in an end-to-end workflow.</li></ul><h3>Ideal Scenarios for RAG</h3><ul><li>In open-domain tasks that require up-to-date or expansive domain-specific knowledge.</li><li>When labeled data is limited, but large volumes of unstructured data are accessible.</li><li>For applications like question answering, dynamic customer support, and research assistance.</li></ul><h3>SFT vs. RAG: A Comparative Analysis</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*K9aO1a7lOhMEX_VY7sxDnw.png" /></figure><figure><img alt="SFT vs RAG: Comparative analysis" src="https://cdn-images-1.medium.com/max/953/1*Yy8oiYcPU9rfRK0SLQuRgQ.png" /><figcaption>Comparative analysis of SFT and RAG</figcaption></figure><h3>Complementary Use</h3><p>SFT and RAG can work in tandem. For example, a model can be fine-tuned via SFT to adopt a desired tone or structure, while RAG provides access to dynamic, up-to-date knowledge. This hybrid approach blends precision with flexibility.</p><h3>Pros and Cons</h3><h3>Supervised Fine-Tuning (SFT)</h3><p><strong>Pros:</strong></p><ul><li>Delivers high accuracy on well-defined tasks</li><li>Allows customization for tone and format</li><li>Produces consistent and predictable results</li></ul><p><strong>Cons:</strong></p><ul><li>Requires significant time and resources for training</li><li>Dependent on the availability of labeled data</li><li>Poor adaptability to unseen or evolving queries</li></ul><h3>Retrieval-Augmented Generation (RAG)</h3><p><strong>Pros:</strong></p><ul><li>Provides access to current and domain-specific knowledge</li><li>Requires minimal labeled data</li><li>Adapts easily across multiple tasks and domains</li></ul><p><strong>Cons:</strong></p><ul><li>Involves a more complex system architecture</li><li>Inference may be slower due to retrieval overhead</li><li>Response quality depends on the relevance of retrieved documents</li></ul><h3>Use Cases</h3><h3>When to Use SFT</h3><ul><li><strong>Customer Feedback Classification</strong>: Fine-tune on labeled feedback data for sentiment analysis.</li><li><strong>Legal Document Summarization</strong>: Train a model using summaries written by legal professionals.</li><li><strong>Healthcare Chatbots</strong>: Customize a model based on medical conversations reviewed by experts.</li></ul><h3>When to Use RAG</h3><ul><li><strong>Customer Support Chatbots</strong>: Retrieve the latest policy documents to handle varied queries.</li><li><strong>Academic Research Assistants</strong>: Retrieve and summarize relevant scholarly articles.</li><li><strong>Enterprise Knowledge Management</strong>: Enable staff to query internal documentation without model retraining.</li></ul><h3>Conclusion</h3><p>Supervised Fine-Tuning (SFT) and Retrieval-Augmented Generation (RAG) offer distinct advantages for enhancing language models. SFT excels in scenarios with ample labeled data and clearly defined tasks, delivering high accuracy and predictability. RAG, by contrast, thrives in open-ended, knowledge-intensive applications where flexibility and access to real-time information are essential.</p><p>Choosing between SFT and RAG depends on your goals, data availability, and operational context. In many situations, a combination of both — using SFT for structure and RAG for content — yields optimal performance. By understanding each approach’s strengths and trade-offs, practitioners can design robust, efficient, and intelligent NLP systems tailored to their needs.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c8e67295ceba" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>