DZone Spotlight

Thursday, February 5 View All Articles »

Building SRE Error Budgets for AI/ML Workloads: A Practical Framework

By Varun Kumar Reddy Gajjala

Here's a problem I've seen happen far too often: your recommendation system is functioning, spitting out results in milliseconds, and meeting all its infrastructure SLAs. Everything is looking rosy in the dashboard world. Yet engagement has plummeted by 40% because your model has been pointless for several weeks. On behalf of your traditional error budget? You're golden. According to your product team? The system is broken. ML systems fail in ways that were not accounted for in classical SRE practices. A model does not 'go down'; it gradually deteriorates. Data pipelines can be 'working' while providing garbage to the model. And you won't even realize this until users start to complain or, worst, quietly depart. The past few years spent breaking and fixing ML systems have taught me that we need a paradigm shift in our error budget. Here's how it works. Understanding the Limitations of Conventional Error Budgets The challenge here is that "reliability" in ML does not live on a one-dimensional spectrum. Your API could be functioning correctly even if your model is not working. Your model could be working correctly even if your data pipeline is providing stale features to your model. You could be doing great on your aggregate numbers even if you're treating some users unfairly. What I've found is that you need to break down four different error budgets. Mapping These to Actual Error Budgets Before delving into each dimension, I must clarify the application of these to conventional SRE error budgets — not merely health checks: For each dimension, you require: • SLI (service level indicator): What you're measuring • SLO (service level objective): Your target over time • Error budget: How much you can miss the SLO before you take action Here's what model quality means with concrete examples: SLI: Accuracy of the model compared with the baseline, hourly SLO: Accuracy ≥ 92% of Baseline over the rolling 7 days Error budget: 8% allowable error in 7 days Burn rate: Monitor hourly; warn for burning above 10% of budget daily The main difference versus an error budget is that you're measuring degradation relative to a known-good state as opposed to just measuring success or failure. The math is exactly the same in both cases — a time budget that gets spent if you don't meet your SLO. Now, let's consider every dimension one by one: 1. Infrastructure Error Budget These are your standard SRE metrics: uptime, latency, and success rate of requests. It's old news, but you should have this as your baseline. What I monitor: 99.95% availability, latency of sub-150ms at p95, 99.9% success rate 2. Model Quality Error Budget This is where it gets fascinating. You must specify at what point you are willing to let the degradation of your model become noisy. What I track: • Model accuracy vs baseline accuracy (typically up to 8% loss) • Percentage of low-confidence predictions • Distribution of feature drift via statistical tests Here's how I can determine degradation: Python # Compare Current Performance with Your Personal Benchmark accuracy_degradation = (baseline_accuracy - current_accuracy) / baseline_accuracy budget_burn_rate = accuracy_degradation / acceptable_degradation Real example: Accuracy decreased from 95% to 93%, my threshold is 8% As for drift detection, I employ the Kolmogorov-Smirnov test: Python # Verify distribution of features has changed from scipy.stats import ks_2samp statistic, p_value = ks_2samp(baseline_features, current_features) drift_alert = p_value < 0.05 One thing that bit me: Tie your model accuracy metrics to business metrics. Rather than accuracy percentages, track something your PM cares about — for example, "click-through rate stays within 95% of baseline." 3. Data Quality Error Budget Garbage in, garbage out. However, the ML system "garbage" needs a different definition. What matters: • Feature completeness score (my target is 99%+) • Feature freshness degree (how many features are stale?) • Schema violations Simple quality check: Python def simple_quality_check(features): missing_rate = missing_features / total_features stale_rate = stale_features / total_features data_quality_score = min(1 - missing_rate, 1 - stale_rate) meets_sli = data_quality_score > 0.99 Traditional data pipelines only cared about having a correct schema. When working with machine learning, you also want to ensure that your data features are fresh enough and that your distributions look fairly regular. I've been burned before working on pipelines that "worked" but passed day-old data, making our model irrelevant. 4. Fairness Error Budget In your case, fairness can be either desirable or mandatory. Regardless, it should be tracked. What I monitor: • Differences in accuracy across demographic groups (this is under 5%) • False positive rate parity across segments To calculate disparate impact: Python # Determine disparate impact group_A_rate = predictions[group == 'A'].mean() group_B_rate = predictions[group == 'B'].mean() disparity = abs(group_A_rate - group_B_rate) violation = disparity > 0.05 # flag if over 5% There is no such dimension in traditional SRE because a traditional system is not involved in people's decision-making. However, as soon as your machine learning system starts approving loans or ranking candidates for jobs, you want to determine whether your system is treating people fairly. Critical Caveats Fairness metrics are extremely domain-specific and complex from a legal standpoint. The metrics that I am presenting here are only examples, and demographic parity is not necessarily a good thing for every problem you want to solve. Before using fairness budgets: Discuss with lawyers the way in which fairness may be considered in your regulatory environmentCoordinate with the product and policy teams on identifying the acceptable tradeoffsReflect on whether you have the right to maintain, process, or use sensitive attributes for monitoring purposesDo not use simplistic parity checks as the sole indicators of fairness In regulated industries such as finance, healthcare, or hiring, you require knowledge that goes beyond the capabilities of any framework. How to Actually Implement This Step 1: Determine How Reliability Applies in Your Business Don't begin with metrics in mind. Begin with conversations instead. "What is a broken model in the eyes of my PM?" "What will make my users grumble?" For an ML-driven search functionality, you can choose: Infrastructure: Less than 200 ms (p95)Model quality: Relevance scores greater than 0.85 relative to human assessorsData quality: Less than 1% of queries missing critical featuresFairness: Search diversity preserved when considering different user categories Step 2: Establish Your Baseline Run your system in a stable state for 30 days. Observe what "good" looks like. Python # Calculate your baseline during a stable period baseline = { 'accuracy': np.percentile(stable_metrics['accuracy'], 50), 'p95_latency': np.percentile(stable_metrics['latency'], 95), 'drift_threshold': calculate_drift_threshold(stable_features) } This becomes your north star. All else shall be measured from that. Step 3: Define Ownership This is crucial. Each dimension must have a "clear owner" to make decisions and take actions: Infrastructure budget → SRE owns: • Right to suspend deployments • Authority to reverse modifications • Infrastructure scaling authority Model quality budget → ML engineering owns: • Authority for triggering retraining • Authority to roll back to previous model version • Power to increase monitoring frequency Data quality budget → data engineering owns: • Power to halt data pipelines • Authority to enable fallback data sources • Right to disregard upstream data Fairness budget → ML + product + legal own together: • Needs a multi-stakeholder decision for any actions • Product evaluates business impact • Legal specifies compliance requirements • ML applies technical solutions If the budget constraints are conflicting, such that model quality is satisfactory, but fairness is violated, then the more constraining budget prevails. If you have depleted your fairness budget, you cannot just rely on your predictions for satisfactory accuracy. Step 4: Monitor Everything Establish dashboards to measure all four key dimensions. Here's how I calculate the composite health factors: Python # Current health across dimensions dimensions = { 'infrastructure': 0.95, # meeting 95% of SLO 'model_quality': 0.88, # at 88% of baseline 'data_quality': 0.98, 'fairness': 0.96 } # Weight them according to what is important to your business weights = { 'infrastructure': 0.3, 'model_quality': 0.35, 'data_quality': 0.2, 'fairness': 0.15 } composite_score = sum(dimensions[d] * weights[d] for d in dimensions) Critical note: The composite score is solely for executive visibility. Hard enforcement always happens on a per-dimension basis. Having a 90% composite score does not supersede a violation in any dimension. You are in violation if you blow your fairness budget. Step 5: Know What to Do When Budgets Blow Up This list should be recorded prior to having a situation on your hands: Infrastructure budget spent out: Stop deployments, undo changes made, see if scale is requiredModel quality budget used up: Kick off the retraining process, think about reverting to the former model version, and look at what changed in your datasetData Quality budget exhausted: Check your upstream data sources, validate your ETL pipeline, turn on feature fallbacks if you have themFairness budget used up: If it's bad, then simply stop making predictions for those subgroups. Don't release it to society until you figure out where you introduced unfair bias and retrain. A Real Example: Fraud Detection Let me illustrate what I mean with a system for preventing fraud that I built for a fintech company. Our error budgets: Infrastructure: 99.99% uptime, under 100ms at p95Model quality: Precision above 95%, Recall above 90%, False Positive Rate below 2%Data quality: +99.5% feature completion rate, <1% stale featuresFairness: FPR differences across merchant types <3% Here's what our code for monitoring looked like: Python # Validating the health of each batch of predictions made def check_fraud_detection_health(predictions, features, ground_truth): # Did model quality degrade? current_precision = precision_score(ground_truth, predictions) precision_violation = (baseline - current_precision) / baseline > 0.02 # Are features getting stale? stale_rate = features[features['age_hours'] > 24].shape[0] / len(features) data_violation = stale_rate > 0.01 # Fairness issues regarding various merchants? fprs = calculate_fpr_by_category(predictions, ground_truth) fairness_violation = max(fprs.values()) - min(fprs.values()) > 0.03 return any([precision_violation, data_violation, fairness_violation]) The "interesting" part: All these dimensions are actually tested in every prediction batch. It helps you detect issues early, as data quality problems could become evident before affecting model performance. A Few Things I've Learned Use Rolling Windows Where Time-Based Budgets Are Required Monthly budgets aren't really effective in ML either. You may have a bad week when you're retraining your model, but you can't waste the rest of the budget. I use 7-day rolling windows instead — still time budgets, but with a sliding window. Python from collections import deque # Measurements deque with maxlen of 7 days * 24 hours measurements = deque(maxlen=168) measurements.append({'timestamp': now, 'accuracy': current_accuracy}) avg_accuracy = sum([m['accuracy'] for m in measurements]) / len(measurements) budget_ok = avg_accuracy >= target_accuracy This provides some buffer for recovering from transient problems without having to call bankruptcy for the month. You're still measuring reliability over time (the point of error budgets), but the window slides smoothly rather than restarting each month. Budget According to What Is Happening In a large product rollout, I'll cut model quality budgets (can't have the model shaming us during peak traffic) while relaxing latency requirements slightly. It's fine to adjust these based on context, just be sure to record the reasoning behind adjustments as they happen. Be Alert for Cascading Failures "Garbage in, garbage out" applies here, too: bad data input leads to bad model output, which, in turn, results in more attempts and fallbacks, thus more load on the infrastructure. It is where having budgets per dimension comes in handy, as it allows you to zero in on where the problem actually occurred. Wrapping Up Conventional error budgets account for failures in infrastructure, such as servers becoming unavailable and requests timing out. They fail to account, however, for failure in ML, which occurs in terms of model drift, pipelines with stale features, and biased predictions in terms of user segments. This framework identifies these failures early. By monitoring the degradation of model quality with time, you address the issue before it affects users. By monitoring the freshness of the data, you identify the pipeline failures before their impact affects your predictions. By monitoring fairness, you identify bias before it turns into a compliance issue. The actual gains in reliability come from the following three sources: Earlier detection: You detect degradation trends before outagesRoot cause clarity: When quality goes down, you know if it's the infrastructure or the quality of the dataClear accountability: Every factor has a clear owner who has clear action power You want to start with the budget on infrastructure and the quality of models. Get familiar with tracking the baseline and calculating the burn rate. Once you're comfortable with that, you can integrate the data quality tracking. Fairness tracking is what you want to do last. It's the most complex aspect of fairness, and it's the most dependent on the domain. Your set of metrics will be different from mine in specifics. A recommendation system can deal with variation in its accuracy results better compared to the fraud detector system. However, the model that consists of four aspects, budgets that consider time intervals, and ownership that is clearly stated has proved to be effective throughout the models involving ML that I have used before. The aim is not about preventing all cases of model deterioration. It is about understanding it, comprehending why it happens, and having the power to correct it before it shatters user trust. More

AI-Powered Spring Boot Concurrency: Virtual Threads in Practice

By Lavi Kumar

Modern microservices face a common challenge: managing multiple tasks simultaneously without putting too much pressure on the systems that follow. Adjusting traditional thread pools often involves a lot of guesswork, which usually doesn't hold up in real-world situations. However, with the arrival of virtual threads in Java 21 and the growth of AI-powered engineering tools, we can create smart concurrency adapters that scale in a safe and intelligent way. This article provides a step-by-step guide to a practical proof-of-concept using Spring Boot that employs AI (OpenAI/Gemini) to assist in runtime concurrency decisions. It also integrates virtual threads and bulkheads to ensure a good balance between throughput and the safety of downstream systems. Why Concurrency Decisions Need Intelligence, Not Just Thread Pools Spring Boot microservices often execute parallel fan-out, which means they make several downstream calls for each incoming HTTP request. In the past, developers adjusted: Thread poolsExecutor settingsBulkheads and timeouts based on their gut feelings. This method can be weak when there are changes in traffic, latency, or variability in downstream services. Even with virtual threads that remove strict limits on thread counts, services still need protections to avoid: Overloaded databasesThread scheduling conflictsRetry stormsPoor tail latency This is where AI can assist by offering contextual suggestions instead of fixed configurations. Solution Summary Our proof of concept includes three key elements: Spring Boot with enabled virtual threads. This utilizes Java 21’s lightweight thread features to prevent blocking I/O from overwhelming the server.AI-driven concurrency advisor. This is a modular component that interacts with the following to suggest a maximum concurrency limit (maximum concurrent requests): OpenAI-compatible endpointsOR Google’s GeminiBulkhead pattern implemented with semaphores. This guarantees that only the recommended number of tasks operate at the same time. The objective: allow AI to assist in identifying the concurrency level that a specific workload can handle. Architecture Here’s how the request flows: The client makes a call to /api/aggregate?fanout=20&forceAi=true.The controller sends the fan-out information to the AI Concurrency Advisor.The advisor utilizes either the AI provider or a heuristic fallback.It returns a JSON object containing maxConcurrency.A semaphore bulkhead is established.Tasks are processed on virtual threads.Responses are gathered and sent back.The advisor does not run threads — it merely suggests limits. Implementation Details Enabling Virtual Threads The application.yml configuration in Spring Boot enables virtual threads: Java spring: threads: virtual: enabled: true This guarantees that the framework processes request handling and asynchronous tasks using virtual threads by default. AI Concurrency Advisor We establish an AiConcurrencyAdvisor interface. Specific implementations consist of: OpenAI clientGemini clientHeuristic fallback Sample JSON prompt utilized in the OpenAI client: JSON { "model":"gpt-4.1-mini", "temperature":0.1, "messages":[ {"role":"system","content":"You are a senior JVM performance engineer…"}, {"role":"user","content":"Operation: aggregate\nFanoutRequested: 50…"} ] } The service analyzes the JSON provided by the model and retrieves a secure maxConcurrency value. Bulkhead With Semaphore After a recommendation is received: Java Semaphore semaphore = new Semaphore(maxConcurrency); Before executing, each downstream task obtains a permit. This guarantees that only the recommended number of tasks operate at the same time — even with an unlimited number of virtual threads. Key Code Snippets AI Advisor Interface This abstraction makes AI optional, interchangeable, and secure. Java public interface AiConcurrencyAdvisor { AdvisorDecision recommend(AdvisorInput input, boolean forceAi); } Separates AI logic from business logicEnables switching between Gemini, OpenAI, or a heuristic fallbackMaintains testable and auditable concurrency decisions Advisor Input Model The quality of AI decisions depends on the context you give. Java public record AdvisorInput( String operation, int fanoutRequested, long expectedDownstreamLatencyMs, int cpuCores, Map<String, Object> hints ) {} Rather than estimating concurrency limits, we offer: Fan-out sizeLatency expectationsCPU capacityWorkload hints This reflects the thought process of a senior engineer regarding concurrency. AI Decision Sanitization Even AI recommendations must be constrained. Java int maxConcurrency = Math.max(1, Math.min(decision.maxConcurrency(), fanout)); Stops uncontrolled concurrencySafeguards downstream systemsGuarantees AI output adheres to system rulesAI provides advice — the system makes the decision. Service fan-out logic: Java try (ExecutorService vtExecutor = Executors.newVirtualThreadPerTaskExecutor()) { List<CompletableFuture<DownstreamResponse>> futures = new ArrayList<>(fanout); AtomicInteger idx = new AtomicInteger(0); for (int i = 0; i < fanout; i++) { futures.add(CompletableFuture.supplyAsync(() -> { boolean acquired = false; try { semaphore.acquire(); acquired = true; int n = idx.incrementAndGet(); return downstream.call("ds-" + n, id, latencyMs); } catch (InterruptedException e) { Thread.currentThread().interrupt(); return new DownstreamResponse("interrupted", "INTERRUPTED", 0); } finally { if (acquired) semaphore.release(); } }, vtExecutor).orTimeout(3, TimeUnit.SECONDS)); } This approach combines virtual threads with a bulkhead, allowing for safe scaling of blocking calls. Starting the Project: Run the Project Set optional environment variables for the AI provider. Execute: Java ./gradlew bootRun Test the endpoints: C curl "http://localhost:8080/api/aggregate?id=123&fanout=20" Include &forceAi=true to enforce AI usage even if no key is set. When to Utilize AI-Driven Concurrency This approach is particularly beneficial when: There is variability downstreamLatency patterns are uncertainManual adjustments are expensiveYou need clear backpressure choices AI suggestions must always be limited and checked with heuristics to guarantee safety in case LLM responses are surprising. Conclusion This proof of concept shows how AI (Gemini/OpenAI) can help with Spring Boot concurrency design. It does not replace human judgment but provides contextual recommendations based on workload characteristics. When paired with Java 21 virtual threads, this method allows for scalable, safe, and observable microservices. More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #397

Secrets Management Core Practices

By Apostolos Giannakidis

CORE

Refcard #375

Cloud-Native Application Security Patterns and Anti-Patterns

By Samir Behara

Building Resilient Industrial AI: A Developer’s Guide to Multi-ERP RAG

The Integration Reality When someone says "AI agent for supply chain," it’s tempting to think first about prompts and setting windows. But in real enterprises, the hard part isn’t generating text — it’s surviving the desegregation reality. Engineers in manufacturing inherit many systems with multiple issues: ERP sprawl across regions, unstructured truth hidden in emails, text files, spreadsheets, and notes, and complex data lineage where SKUs vary by region. Even when leadership wants “one version of the truth,” we inherit system boundaries that were never designed to reconcile cleanly. This playbook focuses on how to architect resilient industrial AI agents in that environment — hybrid by design, grounded in evidence via retrieval, and locked down with guardrails. The Architecture: Hybrid and Local RAG In consumer demos, an agent often implies a single model calling APIs. In industrial settings, an agent is closer to a controlled distributed system. We cannot just upload all ERP data to a vector database; latency, data sovereignty, and sheer volume make that impossible. Instead, we use a hybrid RAG pattern: Cloud control plane: Handles orchestration, user intent, and tool routing.On-prem regional data plane: Keeps the heavy and sensitive data local, exposing only specific retrieval endpoints and connectors to the cloud agent. Step 1: Define a Canonical Model Indexing raw ERP fields from multiple sources will fail due to semantic drift. The date field in one ERP system might mean 'ship date,' while in another it might mean 'Ship_Dt.' So, before indexing anything, define a 'canonical entity model.' This isn’t just documentation; it’s a data contract that acts as a reference layer between the ERPs and the logic of the LLM. Here is a Python example using 'dataclasses' to enforce a contract that normalizes disparate ERP data: Python from dataclasses import dataclass from datetime import datetime import pandas as pd @dataclass(frozen=True) class InventoryPosition: """ #The Canonical Model: The single source of truth for the Agent. """ sku: str site: str as_of_utc: datetime on_hand_qty: float source_system: str lineage_event_id: str def normalize_sap_inventory(sap_payload: dict) -> InventoryPosition: """ #Adapter: Converts raw SAP output into our Canonical Model. #Prevents SAP-specific jargon (MATNR, WERKS) from leaking into the LLM context. """ return InventoryPosition( sku=sap_payload.get("MATNR"), # Material Number site=sap_payload.get("WERKS"), # Plant Code # Crucial: Force UTC conversion to prevent 'time travel' bugs across timezones as_of_utc=pd.to_datetime(sap_payload["TIMESTAMP"]).tz_convert("UTC"), on_hand_qty=float(sap_payload.get("LABST", 0.0)), source_system="SAP_EU_NORTH", lineage_event_id=sap_payload.get("TRACE_ID") ) This ensures that, regardless of whether the data came from SAP or Oracle, the agent always reasons about an 'InventoryPosition.' Step 2: Build Safety Nets, Not Just Scripts Connecting to an old ERP system isn't just about plugging in a wire; it’s about managing chaos. These systems get slow, they crash, and they get confused easily. Don’t just write a quick script and hope for the best. You need to build safeguards. Here is what that looks like in practice: The "click once" rule (idempotency): Agents often retry requests. Enforce unique IDs so that a failed API call doesn’t result in duplicate orders.Surge protection (circuit breakers): Agents can trigger dozens of parallel calls instantly. Use circuit breakers to pause during spikes, preventing the agent from unintentionally causing DDoS issues within legacy servers.The "fix it later" pile (dead-letter queues): Don’t let data sync fail silently. Route logic errors for human review to reconcile the gap between the agent’s intent and ERP’s reality. Python class ResilientERPClient: def execute_safe_transaction(self, tx_id: str, payload: dict): """ Wraps legacy ERP calls with modern distributed system safeguards. """ # [cite_start]1. IDEMPOTENCY (The "Click Once" Rule) [cite: 62-64] # Check if we have already processed this specific transaction ID. # If yes, return the cached result to prevent duplicate orders. if self.cache.exists(tx_id): return self.cache.get(tx_id) try: # [cite_start]2. CIRCUIT BREAKER (Surge Protection) [cite: 65-67] # If the ERP is failing or slow, this context manager raises # a CircuitOpenError immediately, preventing a DDoS. with self.circuit_breaker.guard(): result = self.erp_api.post(payload) # On success, cache the result for future idempotency checks self.cache.set(tx_id, result) return result except CircuitOpenError: # Fail fast so the Agent knows to back off/wait return {"status": "SKIPPED", "reason": "ERP_OVERLOAD_PROTECTION"} except DataValidationError as e: # [cite_start]3. DEAD-LETTER QUEUE (The "Fix It Later" Pile) [cite: 68-69] # Don't silently fail. Log the logic error for human review # to reconcile the Agent's intent with the ERP's constraints. self.dlq.send( tx_id=tx_id, error=str(e), payload=payload ) return {"status": "FLAGGED_FOR_HUMAN_REVIEW"} Step 3: Regional Policy Packs A single global index sounds efficient until you hit regional constraints. A "stock" rule in Europe might be legally different from that in the US. Instead of hardcoding rules into prompts, use configuration files that inject region-specific constraints into the RAG context at runtime. YAML # policy_pack_na.yaml policy_pack: name: "NA-shortage-triage" region: "NA" retrieval: allowed_indexes: ["na-sop", "na-incidents", "na-supplier-contracts"] metadata_filters: classification: ["internal"] max_doc_age_days: 365 autonomy: mode: "recommend_only" # Options: recommend_only | draft_actions | execute approval_required_for: - "expedite_spend" - "promise_date_change" This approach allows you to run "local RAG" (regionally scoped indexes) while keeping the policy control centralized. Step 4: The Security Checkpoint The difference between a helpful AI and a security nightmare is access. Security experts list "excessive control" (giving the AI too much freedom) as a top risk. Always force the agent through a gateway that checks every request. Here are the two rules our "gateway" enforces before running any action: Role-based access: Just because the AI knows how to change a delivery date doesn't mean the user is allowed to do it. If a junior analyst asks the AI to delay a shipment, the gateway should check their job title and say, "Sorry, you don't have permission for that."The human-in-the-loop: For high-risk actions (like changing a confirmed Purchase Order), the AI should never act alone. It should draft the change, pause, and ping a human manager (via Slack or Teams). The action only executes once a human clicks "Approve." Step 5: Managing the Human Boundary Avoid over-reliance by designing for human engagement: force the agent to display evidence alongside recommendations and require user feedback to close the loop. The gathered feedback can be appended to the agent workflow instructions to provide better recommendations in the future. Conclusion: From Clarity to Resilience An agent is not a magic wand that automatically repairs data across fragmented ERPs. It is a distributed system that requires the same rigor as any other critical infrastructure. By using the above methodology, we move beyond "chatting with data" to building systems that are evidence-based, failure-resistant, and trusted to keep the supply chain running.

By Rahul Kumar Thatikonda

Selenium Test Automation Challenges: Common Pain Points and How to Solve Them

You have written your first Selenium test suite, watched it pass locally, and felt the satisfaction of automation success. Then you pushed it to CI. The next morning, half your tests failed for reasons that made no sense. Welcome to the real world of Selenium test automation. Selenium remains one of the most widely adopted web automation frameworks for good reason. It offers unmatched flexibility, supports multiple programming languages, and benefits from a massive community that has been refining best practices for nearly two decades. But adopting Selenium is just the beginning. The real challenge starts when you scale beyond a handful of test cases and discover that writing tests is the easy part. Keeping them running reliably is where teams struggle. Let's walk through the most common pain points that testing teams encounter with Selenium and provide actionable strategies to address each one. Some of these challenges are inherent to browser automation itself, while others stem from implementation decisions that seem reasonable at first but create problems at scale. Understanding the difference helps teams focus their efforts where it matters most. The Flaky Test Problem Flaky tests are the silent productivity killer in test automation. A test that passes 8 out of 10 times erodes trust in the entire suite. Teams start ignoring failures, re-running pipelines "just to see if it passes this time," or worse, disabling tests altogether. Once this pattern takes hold, the automation effort loses its value as a quality gate. The root causes of flakiness almost always trace back to timing and synchronization: Your test executes faster than the application respondsA button exists in the DOM, but is not yet clickableAn AJAX call has not returned before your assertion runsNetwork latency varies between local and CI environmentsShared test environments create unpredictable data states The problem compounds in CI environments. Locally, your machine is fast and consistent. In CI, you are competing for resources, network latency varies, and the application might behave slightly differently under load. A test that never fails on your laptop fails every third run in the pipeline. Many teams attempt to solve this with implicit waits. They set a global timeout and assume they are covered. This approach creates more problems than it solves. Implicit waits apply to every element lookup, which slows down tests that would otherwise pass quickly. Worse, when you mix implicit waits with explicit waits, the behavior becomes unpredictable. Selenium does not handle this combination gracefully. The solution is explicit waits with ExpectedConditions for specific element states. Rather than waiting a fixed duration and hoping the element is ready, you wait for exactly the condition you need: Wait for the element to be visibleWait for it to be clickableWait for specific text to appearWait for an element to disappear (like a loading spinner) This approach is both faster and more reliable because you proceed as soon as the condition is met rather than waiting an arbitrary duration. For complex scenarios, fluent waits with custom polling intervals give you fine-grained control over how frequently to check the condition and which exceptions to ignore during the waiting period. Consider this transformation. Instead of writing Thread.sleep(3000) and hoping three seconds is enough, you write an explicit wait that checks every 500 milliseconds for the element to become clickable and proceeds immediately when it does. The test becomes both faster and more stable. Flaky tests are symptoms, not root causes. Investing time in proper synchronization strategies pays dividends in pipeline stability and team confidence. Test Maintenance Becomes a Full-Time Job The application under test evolves constantly. A designer tweaks a button class. A developer restructures a form layout. A new feature adds elements that shift the position of existing ones. Suddenly, 50 tests fail, and none of the failures represent actual bugs. Teams find themselves spending more time fixing tests than writing new ones, and the backlog of automation work grows while manual testing fills the gap. This maintenance burden usually stems from brittle locators. When tests rely on absolute XPath expressions or auto-generated IDs that change with every build, any UI modification cascades through the test suite. The problem is compounded when tests are developed through copy-paste, creating dozens of files that all reference the same element in slightly different ways. Locator selection follows a reliability spectrum, from most stable to least stable: Static IDs: Most reliable, but many applications generate IDs dynamicallyData attributes (data-testid, data-qa): Nearly as reliable, requires developer collaborationCSS selectors: Good balance of stability and readabilityRelative XPath: Works when other options are unavailableAbsolute XPath: Breaks constantly and should be avoided entirely The Page Object Model transforms maintenance from a nightmare into a manageable task. When you encapsulate page interactions in dedicated classes, locators live in one place. A UI change requires updating a single file rather than hunting through dozens of test scripts. The test code itself reads like a description of user behavior rather than a collection of element lookups. Beyond the structural benefits, POM encourages thoughtful locator design. When you consciously create a page object, you think about which elements matter and how to identify them reliably. You start conversations with developers about adding test attributes. You build helper methods that handle common interaction patterns, including the waiting logic discussed earlier. Collaboration with development teams makes a significant difference. When developers add data-testid attributes as a standard practice, the testing team gains stable anchors that survive CSS refactoring and layout changes. This small investment during development dramatically reduces automation maintenance downstream. Regular locator audits should be part of sprint hygiene. Before a release, review which locators have broken most frequently and refactor them. Identify patterns that cause problems and establish conventions that prevent them. Treat your test code with the same care you would treat production code, because in many ways, it is production code. Cross-Browser Testing Inconsistencies Your test suite runs perfectly in Chrome. Then QA reports a bug in Safari. You run the same tests in Safari and discover three failures that have nothing to do with actual application bugs. The tests fail because browsers implement WebDriver commands differently, and what works seamlessly in one browser behaves unexpectedly in another. These inconsistencies appear in surprising places: JavaScript execution timing varies across browser engines.Some CSS selectors that work in Chrome fail in Firefox.File upload handling differs between browsers.Alert and pop-up behavior is not consistent.Scroll calculations and viewport dimensions use different reference points. Each browser has its quirks, and WebDriver cannot fully abstract them away. Selenium Grid solves the parallel execution problem but introduces infrastructure complexity. Maintaining browser versions across nodes, handling node failures gracefully, and managing resource allocation becomes a DevOps responsibility. For teams without dedicated infrastructure support, this overhead can be substantial. A pragmatic approach to cross-browser testing includes: Starting with a realistic browser matrix based on actual user analytics rather than theoretical coverageSetting browser-specific capabilities and options intentionallyImplementing conditional logic sparingly for known browser quirksIsolating workarounds to helper methods with clear documentationConsidering cloud-based Selenium Grid services to offload infrastructure management If 85% of your users are on Chrome and 10% are on Firefox, prioritizing those browsers makes sense. Testing on Internet Explorer because some internal stakeholders might use it wastes resources that could go toward higher-value coverage. Cloud-based Selenium Grid services offer an alternative to self-hosted infrastructure. Platforms like Sauce Labs, BrowserStack, and LambdaTest maintain browser versions and handle scaling automatically. The trade-off is cost versus control, and the right choice depends on your team's resources and requirements. Cross-browser testing reveals both application bugs and automation fragility. Before filing a bug report, verify that the failure represents an actual user-facing issue rather than an artifact of how WebDriver interacts with that particular browser. The Hidden Cost of Free and Flexible Selenium is free to download, but not free to implement. Teams consistently underestimate the investment required to build a production-ready automation framework. The flexibility that makes Selenium powerful also means it provides no opinions about how to structure your project, generate reports, or integrate with CI/CD pipelines. Building a complete automation solution requires assembling multiple components: A test runner like TestNG, JUnit, or pytest to organize and execute testsReporting mechanisms beyond console output for stakeholder visibilityConfiguration management for running tests across environmentsData-driven testing infrastructure to manage test inputs and expected outputsLogging and screenshot capture on failure for debuggingCI/CD pipeline integrationParallel execution infrastructure Each of these capabilities requires code, libraries, and maintenance. A team starting from scratch might spend months building framework infrastructure before writing tests that validate application behavior. The skill gap compounds the problem. Not every tester has a development background. Organizations often have domain experts who understand the application deeply and can design excellent test cases, but lack the programming fluency that pure Selenium requires. Creating a bottleneck where only senior developers can write and maintain automation tests limits the team's capacity and creates single points of failure. Several approaches can bridge this gap: Structured training programs that build programming fundamentals alongside automation conceptsInternal frameworks that abstract Selenium complexity behind simpler interfacesEstablishing coding standards and code review processes to catch problems earlyHybrid approaches where complex scenarios use code while simpler ones use keyword-driven methodsTools built on top of Selenium (such as Katalon Studio, Robot Framework, or similar platforms) that reduce the framework-building burden The trade-off with pre-built platforms is less flexibility for reduced overhead, and the right balance depends on your team's composition and priorities. Framework maturity directly impacts team velocity, so the decision to build versus buy is a legitimate strategic question that deserves careful analysis. Limited Built-in Reporting and Debugging A test fails in CI. The log says "Element not found." Which element? At what step? What did the page look like at that moment? Was it a locator problem, or did the page fail to load entirely? Out of the box, Selenium provides minimal context for understanding failures. This debugging struggle wastes significant time. Stack traces point to code lines but reveal nothing about the application state. Developers receiving bug reports from automation need to reproduce failures manually because the automated results do not provide enough information to diagnose the problem. Tests that fail intermittently are nearly impossible to investigate without additional tooling. Effective reporting infrastructure requires deliberate investment: Screenshot capture on every failure (automatic, not optional)Step-by-step execution logs with timestampsIntegration with reporting platforms like Allure, ExtentReports, or TestRailVideo recording for complex failure scenariosNetwork request logging to identify application versus test issuesEnvironment and browser metadata captured with every test run When a test fails, you should see exactly what the browser displayed at that moment. Knowing that a failure occurred on Chrome 120 in a Linux container helps narrow down browser-specific issues. Without this context, debugging becomes guesswork. Most teams deprioritize reporting infrastructure until failures become unmanageable. By then, they have accumulated technical debt that makes the problem harder to solve. Building reporting capabilities early, even if simple, pays dividends as the test suite grows. Scaling Beyond a Single Machine The test suite started small. You wrote tests as features were developed, and each one added value. Now the suite has grown to 500 tests. Running them sequentially takes four hours. Feedback loops that once took minutes now stretch to half a day. Developers stop waiting for results and merge changes without knowing whether tests passed. This scaling problem has both infrastructure and design dimensions. On the infrastructure side, you have several options with different trade-offs: Local parallel execution: Helps but quickly hits hardware limits (typically 4 to 8 browser instances)Self-hosted Selenium Grid: Distributes execution but requires DevOps investmentContainer-based approaches: Docker simplifies maintenance but has its own learning curveCloud-based grids: Eliminates infrastructure management at the cost of per-minute pricing The design dimension is equally important. Tests must be truly independent to run in parallel. Shared state between tests creates race conditions that are even harder to debug than single-threaded flakiness. Designing for parallelism requires: Database fixtures with isolation strategies so tests do not step on each other's dataThread-safe login and session handlingTest data generation that avoids conflicts when multiple tests run concurrentlyNo assumptions about execution order Retrofitting parallelism onto a test suite designed for sequential execution is painful. Tests that worked reliably when running alone fail randomly when running alongside others. The fix often requires significant refactoring to eliminate shared state and ensure true independence. Building for parallelism from the beginning is far easier than adding it later. Even if you initially run tests sequentially, designing them as independent units positions you for scaling when the need arises. Choosing the Right Approach Selenium remains the foundation of web automation for good reason. Its flexibility supports complex scenarios that simpler tools cannot handle. Browser vendor support ensures compatibility with new releases. The ecosystem of libraries and integrations is unmatched. But flexibility comes with responsibility, and not every team is positioned to build everything from scratch. Several questions help clarify the right path forward: What is your team's technical skill distribution?How much infrastructure can you realistically maintain?Where is your time actually going: writing tests or maintaining the framework?What feedback loop duration is acceptable for your development process? The spectrum of solutions ranges from pure Selenium for teams with strong engineering capability and unique requirements, through Selenium-based platforms that reduce overhead while preserving flexibility, to alternative frameworks like Playwright or Cypress that make different trade-offs entirely. Each position on this spectrum represents valid choices for different contexts. The goal is not to find the theoretically best tool but to find the approach that delivers reliable feedback quickly enough to improve your software development process. That answer differs for every team. Moving Forward The challenges covered here are real, but they are also solvable. Every mature testing organization has faced them. The difference between teams that succeed with automation and teams that struggle is not talent or budget. It is the willingness to treat test infrastructure as a first-class engineering concern that deserves thoughtful investment. No team solves all these problems simultaneously. Prioritize based on current pain: If flaky tests dominate your pipeline, focus on synchronization strategies.If maintenance burden consumes your capacity, invest in page objects and a locator strategy.If feedback loops are too slow, tackle parallelization.If debugging takes too long, build a reporting infrastructure. Each improvement builds on the previous one. The goal is not perfect automation. Perfection is unattainable, and pursuing it wastes resources. The goal is automation that provides reliable feedback fast enough to be useful for the developers and testers who depend on it. That is achievable with intentional, sustained investment in both the tests themselves and the infrastructure that supports them.

By Oliver Howard

Mastering Fluent Bit: Developer Guide to Routing to Prometheus (Part 13)

This series is a general-purpose getting-started guide for those of us wanting to learn about the Cloud Native Computing Foundation (CNCF) project Fluent Bit. Each article in this series addresses a single topic by providing insights into what the topic is, why we are interested in exploring that topic, where to get started with the topic, and how to get hands-on with learning about the topic as it relates to the Fluent Bit project. The idea is that each article can stand on its own, but that they also lead down a path that slowly increases our abilities to implement solutions with Fluent Bit telemetry pipelines. Let's take a look at the topic of this article, integrating Fluent Bit with Prometheus. In case you missed the previous article, check out the developer guide to telemetry pipeline routing, where you explore how to direct telemetry data to different destinations based on tags, patterns, and conditions. This article will be a hands-on exploration of Prometheus integration that helps you, as a developer, leverage Fluent Bit's powerful metrics capabilities. We'll look at the first of three essential patterns for integrating Fluent Bit with Prometheus in your observability infrastructure. All examples in this article have been done on OSX and assume the reader is able to convert the actions shown here to their own local machines. Integrating With Prometheus? Before diving into the hands-on examples, let's understand why Prometheus integration matters for Fluent Bit users. Prometheus is the de facto standard for metrics collection and monitoring in cloud native environments. It's another CNCF graduated project that provides a time-series database optimized for operational monitoring. The combination of Fluent Bit's lightweight, high-throughput telemetry pipeline with Prometheus's battle-tested metrics storage creates a powerful observability solution. Fluent Bit provides several ways to integrate with Prometheus. You can expose metrics endpoints that Prometheus can scrape (pull model), push metrics directly to Prometheus using the remote write protocol, or even scrape existing Prometheus endpoints and route those metrics through your telemetry pipeline. This flexibility allows Fluent Bit to act as a metrics aggregator, forwarder, or even a replacement for dedicated metrics agents in resource-constrained environments. What Is Prometheus Integration? There are several compelling reasons to integrate Fluent Bit with Prometheus in your infrastructure. First, Fluent Bit can collect system-level metrics using its built-in Node Exporter Metrics plugin, eliminating the need to deploy a separate Prometheus Node Exporter. This reduces resource usage and simplifies your deployment. Second, Fluent Bit can monitor itself and expose internal pipeline metrics, giving you visibility into the health and performance of your telemetry infrastructure. Understanding how your telemetry pipeline is performing is critical for maintaining reliable observability. This will be covered in a future article. Third, Fluent Bit can act as a metrics proxy, scraping metrics from various sources and forwarding them to Prometheus. This is particularly useful when you need to aggregate metrics from multiple sources or transform them before they reach Prometheus. This will be explored in a future article. Let's dive into the first pattern, collecting system-level metrics using its built-in Node Exporter Metrics plugin. Where to Get Started You should have explored the previous articles in this series to install and get started with Fluent Bit on your developer's local machine, either using the source code or container images. Links at the end of this article will point you to a free hands-on workshop that lets you explore more of Fluent Bit in detail. You can verify that you have a functioning installation by testing your Fluent Bit, either using a source installation or a container installation, as shown below: Plain Text # For source installation. $ fluent-bit -i dummy -o stdout # For container installation. $ podman run -ti ghcr.io/fluent/fluent-bit:4.2.2 -i dummy -o stdout ... [0] dummy.0: [[1753105021.031338000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105022.033205000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105023.032600000, {}], {"message"=>"dummy"}] [0] dummy.0: [[1753105024.033517000, {}], {"message"=>"dummy"}] ... Let's explore the three Prometheus integration patterns that will help you with your observability infrastructure. How to Integrate With Prometheus See this article for details about the service section of the configurations used in the rest of this article, but for now, we plan to focus on our Fluent Bit pipeline and specifically the Prometheus integration capabilities that can be of great help in managing metrics in your observability stack. Below, in the figure, you see the phases of a telemetry pipeline. Metrics collected by input plugins flow through the pipeline and can be routed to Prometheus-compatible outputs. Understanding how metrics flow through Fluent Bit's pipeline is essential for effective Prometheus integration. Input plugins collect metrics, which then pass through filters for transformation, before being routed to output plugins that deliver metrics to Prometheus. Now, let's look at the first of the three most useful Prometheus integration patterns that developers will want to know about. 1. Routing Metrics Through Fluent Bit to Prometheus The first integration pattern involves collecting host-level metrics using Fluent Bit's built-in Node Exporter Metrics plugin and exposing them to Prometheus for scraping. This pattern is incredibly valuable because it allows you to collect system metrics without deploying a separate Prometheus Node Exporter agent. The Node Exporter Metrics input plugin implements a subset of the collectors available in the original Prometheus Node Exporter. It collects CPU statistics, memory usage, disk I/O, network interface statistics, filesystem information, and more. The beauty of this approach is that all these metrics flow through Fluent Bit's pipeline, where you can transform, filter, and route them as needed. To demonstrate this pattern, let's create a configuration file called fluent-bit.yaml that collects host metrics and exposes them through a Prometheus endpoint: Plain Text service: flush: 1 log_level: info http_server: on http_listen: 0.0.0.0 http_port: 2020 hot_reload: on pipeline: inputs: - name: node_exporter_metrics tag: node_metrics scrape_interval: 2 outputs: - name: prometheus_exporter match: node_metrics host: 0.0.0.0 port: 2021 add_label: - app fluent-bit - environment development # testing output to console - name: stdout match: node_metrics format: json_lines Our configuration uses the node_exporter_metrics input plugin to collect system metrics every two seconds. The prometheus_exporter output plugin then exposes these metrics on port 2021 in a format that Prometheus can scrape. We've also added the custom labels app and environment that will be attached to all metrics, making it easier to filter and query them in Prometheus. Let's run this configuration as follows: Plain Text # For source installation. $ fluent-bit --config fluent-bit.yaml # For container installation after building new image with your # configuration using a Buildfile as follows: # # FROM ghcr.io/fluent/fluent-bit:4.2.2 # COPY ./fluent-bit.yaml /fluent-bit/etc/fluent-bit.yaml # CMD [ "fluent-bit", "-c", "/fluent-bit/etc/fluent-bit.yaml" ] # $ podman build -t fb -f Buildfile # Note: For container deployments collecting linux host metrics, you need # to mount the host's /proc and /sys filesystems: # $ podman run --rm -v /proc:/host/proc:ro -v /sys:/host/sys:ro -p 2021:2021 fb $ podman run --rm fb ... [2026/01/19 15:25:47.115361000] [ warn] [input:node_exporter_metrics:node_exporter_metrics.0] calling IORegistryEntryGetChildEntry is failed 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="0",mode="user"} = 25039.200000000001 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="0",mode="system"} = 9067.2999999999993 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="0",mode="nice"} = 0 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="0",mode="idle"} = 48662.790000000001 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="1",mode="user"} = 23096.860000000001 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="1",mode="system"} = 7764.7299999999996 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="1",mode="nice"} = 0 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="1",mode="idle"} = 52016.459999999999 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="2",mode="user"} = 20056.130000000001 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="2",mode="system"} = 6364.9700000000003 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="2",mode="nice"} = 0 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="2",mode="idle"} = 56597.839999999997 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="3",mode="user"} = 17696.98 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="3",mode="system"} = 5385.8999999999996 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="3",mode="nice"} = 0 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="3",mode="idle"} = 60055.519999999997 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="4",mode="user"} = 412.75 2026-01-19T14:25:47.114513143Z node_cpu_seconds_total{cpu="4",mode="system"} = 116.18000000000001 ... Note that the warning entry is an OS X-specific issue with the node_exporter_metrics input plugin. The node_exporter_metrics plugin tries to collect system metrics similar to Prometheus Node Exporter. On OSX, it uses Apple's IOKit framework to access hardware information through the IORegistry (a hierarchical database of hardware devices). Our console output for testing shows all the available metrics about this machine that are being collected every 1s. This gives us something to work with and query once it's sent to a Prometheus backend. Now, verify the metrics are being tagged with custom labels by opening a browser window to http://localhost:2021/metrics, and we should see the following: Plain Text # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="0",mode="user"} 25095.049999999999 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="0",mode="system"} 9092.3299999999999 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="0",mode="nice"} 0 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="0",mode="idle"} 48874.290000000001 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="1",mode="user"} 23145.150000000001 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="1",mode="system"} 7784.9700000000003 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="1",mode="nice"} 0 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="1",mode="idle"} 52240.839999999997 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="2",mode="user"} 20091.23 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="2",mode="system"} 6379.5500000000002 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="2",mode="nice"} 0 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="2",mode="idle"} 56841.639999999999 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="3",mode="user"} 17723.43 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="3",mode="system"} 5396.7299999999996 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="3",mode="nice"} 0 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="3",mode="idle"} 60312.18 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="4",mode="user"} 412.89999999999998 node_cpu_seconds_total{app="fluent-bit",environment="development",cpu="4",mode="system"} 116.25 ... Notice how the metrics include our custom labels app="fluent-bit" and environment="development". These labels are automatically added to every metric by the Prometheus exporter output plugin, making it easy to identify and filter metrics in your Prometheus queries. To integrate this with Prometheus, add a scrape configuration to your Prometheus configuration file prometheus.yml as follows: YAML scrape_configs: - job_name: 'fluent-bit-node-metrics' static_configs: - targets: ['localhost:2021'] scrape_interval: 10s This configuration tells Prometheus to scrape the Fluent Bit metrics endpoint every 10 seconds. The metrics will then be available for querying in Prometheus and can be visualized in the Prometheus console or using the Perses project for dashboards. The Node Exporter Metrics plugin supports numerous collectors, including CPU, disk I/O, filesystem, load average, memory, network interface, and more. You can selectively enable or disable collectors based on your monitoring needs, and set individual scrape intervals for each collector type. It's left to the reader to run their own Prometheus instance with this configuration and to explore the collected metrics telemetry data. A primer to do this, if you need help, can be found in this hands-on free online Prometheus workshop. More in the Series In this article, you explored the first of three powerful patterns for integrating Fluent Bit with Prometheus: collecting and exposing host metrics. In the following article, we will continue to look at monitoring Fluent Bit's internal pipeline health and using Fluent Bit as a metrics proxy with remote write capabilities. This article is based on this online free workshop. There will be more in this series as you continue to learn how to configure, run, manage, and master the use of Fluent Bit in the wild. Next up, we'll explore monitoring Fluent Bit's internal pipeline health.

By Eric D. Schabell

CORE

Modern Vulnerability Detection: Using GNNs to Find Subtle Bugs

For over 20 years, static application security testing (SAST) has been the foundation of secure coding. However, beneath the surface, many legacy SAST tools still operate using basic techniques such as regular expressions and lexical pattern matching; essentially, sophisticated versions of the Unix command grep. As a result, most SAST tools suffer from what I call “false positive fatigue.” These tools report every occurrence of a strcpy() (or similar) regardless of whether the buffer is mathematically proven to be safe. This article explores an innovative method for detecting vulnerabilities using graph neural networks (GNNs). In contrast to viewing source code as a linear string of characters, GNNs represent code as a structured graph of logical and data-flow structures. As such, we can now develop models that understand how a user’s input at line 10 in the code ultimately relates to a database query at line 50, even when variable names are changed three times between those two points in the code. Why Traditional Tools Fail Traditional SAST tools fail to identify vulnerabilities due to their flat representation of code. For example, consider this Python code snippet: Python def get_user_data(user_input): sanitized = clean_input(user_input) # ... 50 lines of complex logic ... cursor.execute("SELECT * FROM users WHERE name = " + sanitized) A Regex-based tool identifies the SQL injection threat due to the string concatenation in a SQL query. The tool does not recognize the clean_input() function because it cannot understand the data flow across the function boundaries. However, a GNN can model the path that the variable follows. Therefore, if the path includes a sanitization node, the GNN can learn to classify it as “safe.” Transforming Code to Graphs: Understanding the CPG To utilize GNNs, we first convert source code into a math-friendly format known as a Code Property Graph (CPG). The CPG merges three classic graph types: Abstract Syntax Tree (AST) – the syntax tree of the source code (loops, if statements)Control Flow Graph (CFG) – the order in which the source code executes (Path A vs. Path B)Program Dependence Graph (PDG) – the dependencies between variables (data flow) Therefore, the CPG is a rich graph where the nodes represent code elements (variable declarations, operator assignments), and the edges represent relationships between those code elements (“calls,” “defines,” “depends on”). Tools: Generating Graphs With Joern Fortunately, you do not need to write a parser from scratch to create a CPG. Joern is an open-source tool that parses C/C++, Java, and Python and generates CPGs. Shell # Install Joern (Linux/Mac) ./joern-install.sh # Convert a source file into a CPG joern-parse --output cpg.bin ./my_vulnerable_app/ You can then export the generated graph to a format your neural network can read (e.g., a CSV file of nodes and edges). The GNN Model: Learning “Shapes” of Bugs After generating a graph, you can feed it into a graph neural network (GNN). Similar to a standard neural network that expects a static-sized image or text vector, a GNN utilizes a technique called “message passing.” Node embedding: Every node (for example, x = 5) receives a vector to represent its node type (assignment) and content.Message passing: Every node communicates with its neighboring nodes. The x variable node sends a message to the if (x > 0) node.Aggregation: After 3-4 iterations of passing messages, every node “knows” about its immediate neighborhood. The SQL Execute node “knows” it was connected to a User Input node four hops away. Implementation Example (PyTorch Geometric) Below is a simplified version of a GNN model to classify vulnerabilities using the popular PyTorch Geometric library. Python import torch from torch_geometric.nn import GCNConv, global_mean_pool import torch.nn.functional as F class VulnDetectorGNN(torch.nn.Module): def __init__(self, num_node_features, num_classes): super(VulnDetectorGNN, self).__init__() # Graph Convolutional Layers self.conv1 = GCNConv(num_node_features, 64) self.conv2 = GCNConv(64, 64) self.conv3 = GCNConv(64, 64) # Final classifier self.linear = torch.nn.Linear(64, num_classes) def forward(self, data): x, edge_index, batch = data.x, data.edge_index, data.batch # 1. Message Passing Layers (with ReLU activation) x = self.conv1(x, edge_index) x = F.relu(x) x = self.conv2(x, edge_index) x = F.relu(x) x = self.conv3(x, edge_index) # 2. Global Pooling (Aggregate all nodes into 1 graph vector) x = global_mean_pool(x, batch) # 3. Classifier (Safe vs. Vulnerable) x = F.dropout(x, p=0.5, training=self.training) x = self.linear(x) return F.log_softmax(x, dim=1) # Create Model model = VulnDetectorGNN(num_node_features=100, num_classes=2) print(model) Dataset: CodeXGLUE You cannot train a model without data. The CodeXGLUE dataset provided by Microsoft is widely accepted within the industry as the standard dataset for developing models that can detect vulnerabilities. CodeXGLUE contains thousands of C/C++ functions labeled as either “Vulnerable” or “Safe” based upon actual CVEs found in well-known open-source projects such as FFmpeg and QEMU. Training tip: Due to the fact that real-world vulnerabilities occur very rarely (possibly 1 out of every 1,000 functions), you must balance your dataset (over-sample the vulnerable functions), or your model will simply guess “Safe” 99.9% of the time and claim to have a high degree of accuracy. Application of GNNs in DevSecOps While engineers should not discard their legacy SAST tools, such as Fortify or Checkmarx, GNNs are best utilized as a second opinion. Triage assistant: Run standard SAST on your application. Take the top 500 “High” findings and pass them through a GNN model trained specifically on your codebase’s history of “False Positive” vulnerabilities.Filter: If the GNN indicates that the finding appears to resemble one of the 500 “False Positive” vulnerabilities that were identified previously, then mark it as low-priority.Custom rules: Utilize GNNs to discover vulnerabilities that cannot be identified by regex-based rules, including complex logic vulnerabilities or missing authorization checks spanning multiple files. Conclusion The future of vulnerability detection will be driven by semantics (the meaning behind the code) rather than syntax (how the code is written). By modeling source code as a graph, we can better capture the author’s intent. Although GNNs consume more computational resources than a simple regex-based script, their reduced false positive rates make them a valuable addition to the current array of application security tools.

By Rahul Karne

A Generic MCP Database Server for Text-to-SQL

Text-to-SQL is quickly becoming one of the most practical applications of large language models (LLMs). The idea is appealing: write a question in plain English, and the system generates the correct SQL query. But in practice, the results are mixed. Without structured schema information, models often: Hallucinate tables or columns that don’t exist.Struggle with ambiguous names.Overload on too much schema context at once. This is where a generic MCP (Model Context Protocol) database server comes in. Using YAML-based schema contracts, the server provides clear, consistent schema definitions that guide the model. Combined with a two-step prompting approach and strong security guardrails, this framework makes Text-to-SQL both accurate and safe. High-Level Architecture and Query Flow At a high level, the architecture looks like this: Client – A UI, portal, or application where a user asks a question in natural language.MCP Client – Acts as a broker, sending the NLP query to the LLM and coordinating with the MCP Server.LLM – Converts natural language into SQL, guided by schema metadata from YAML.MCP Server – Validates queries, manages schema context, and executes SQL safely.Database – Runs the actual query (using a read-only connection).Results – Flow back through the MCP Client to the user. Architecture View Server: Client: The SQL query is never executed directly by the LLM. Instead, it is validated and run by the MCP Server, ensuring guardrails are in place. Core Components of the MCP Server The MCP Server is built from modular, pluggable components: Shared Core Platform AuthN and AuthZ – Authentication and role-based authorization.Row and column security – Restrict sensitive data.Observability and telemetry – Metrics, logs, and tracing.Cache – Faster schema lookups.Guardrails – Query validation, throttling, and rate limiting.Governance – Policy tags, PII masking, and compliance rules.Executor – Timeouts, retries, and resource limits.Audit – Full logging for compliance. YAML Schema Contract Tables, columns, and semantic roles (dimension/measure/grain).Metadata: descriptions, synonyms, and value examples.Dictionaries: value sources with normalization rules.Optional policy metadata like masking tags. Pluggable Adapters Works across Redshift, Snowflake, PostgreSQL, MySQL, BigQuery, DynamoDB, and more. Component breakdown: What the YAML Does The YAML schema file is the beating heart of the system. Each schema gets one YAML file, and that file includes: Tables and columns – With types and semantic rolesDescriptions – Human-readable explanations for columnsSynonyms – Alternate names for natural language matchingValue examples – Sample data to improve groundingValue sources – Queries to hydrate canonical dictionariesNormalization rules – Enforce casing, trimming, and formatting Sample Column Definition YAML - name: country_name type: varchar description: Standardized country name nullable: false role: dimension synonyms: - nation - region - geography examples: - United States - India - Germany source: db: "{{ env.schema }.dim_country" query: > select distinct trim(country_name) as country_name from {{ env.schema }.dim_country where country_name is not null order by country_name refresh: schedule: "0 2 * * *" ttl_minutes: 1440 enrichment: normalize: case: upper This tells the system everything it needs to know about the country_name column: its type, meaning, synonyms, examples, and valid values. Two-Step Prompting to Reduce Hallucination Dumping entire schemas into an LLM prompt is a recipe for hallucination. Instead, the server uses a two-step strategy: Step 1: Narrow Down Tables Prompt: “Here is a list of tables with descriptions. Based on the user’s question, which tables are relevant?” Example: Plain Text users: Information about registered application users sessions: Tracks user login sessions and activity durations logs: System event logs with timestamps and error details features: Metadata about application features and modules User question: “How many login sessions ended with an error in the past week?” LLM response: Relevant tables: sessions, logs Step 2: Load Relevant Columns Only Prompt: “Here are the detailed column definitions for the relevant tables. Write a SQL query to answer the question.” Example: Plain Text sessions: session_id: bigint – unique session identifier user_id: bigint – foreign key to users.user_id start_time: timestamp – when the session began end_time: timestamp – when the session ended logs: log_id: bigint – unique log entry identifier session_id: bigint – links to sessions.session_id event_type: varchar – type of event (login, error, logout) event_time: timestamp – time when the event occurred LLM output: SQL SELECT COUNT(DISTINCT s.session_id) FROM sessions s JOIN logs l ON s.session_id = l.session_id WHERE l.event_type = 'error' AND s.end_time >= CURRENT_DATE - INTERVAL '7 days'; This reduces context size, improves accuracy, and avoids hallucinations. Guardrails for Safety Letting an LLM generate queries requires strict controls. The MCP Server enforces: Security Guardrails Read-only database user – No write permissionsSELECT-only validation – Rejects DELETE, UPDATE, INSERT, and DROPTimeouts and limits – Prevent runaway queriesRow and column security – Filters sensitive fieldsAudit logging – Every query is recorded This ensures Text-to-SQL can be deployed in production without risking data corruption. Deployment Model The deployment is lightweight: you only need two inputs per tenant: YAML schema – Tables, columns, semantics, and dictionaries.Database connection – Engine, credentials/role, and search path. The rest is shared and immutable across tenants: Same container imageSame tooling and APIsUniform runtime environment Key Benefits Fast onboarding – Add a new database in minutes.Lower ops cost – No per-tenant rebuilds.Uniform security – Same hardened runtime across all tenants.Zero code rebuilds – Add tenants by config, not engineering effort.Observability – Query performance and errors are fully tracked.Consistency – LLMs always see the same schema definitions. End-to-End Example Flow A user asks: “How many login sessions ended with an error in the past week?.”MCP Client sends the query to the LLM.LLM first sees only table names + descriptions.LLM selects orders, products.MCP Client then loads column metadata for those tables.LLM generates a SQL query.MCP Server validates (SELECT-only, read-only).Database executes the query.Results are returned to the MCP Client.User sees the final result in the UI. The process is transparent, auditable, and safe. Conclusion Text-to-SQL promises a more natural way to interact with data, but it needs structure and safeguards to work in production. A generic MCP database server, powered by YAML schema contracts and a two-step prompting strategy, delivers exactly that: Structured YAML → Tables, columns, semantics, and dictionaries.Two-step prompts → Reduce hallucination by filtering tables first.MCP server guardrails → Enforce SELECT-only, read-only execution.Pluggable adapters → Support for PostgreSQL, MySQL, Redshift, Snowflake, and more.Fast deployment → Add new schemas with YAML + connection only. Instead of brittle integrations and scattered documentation, you get a scalable, extensible framework that brings Text-to-SQL into real-world enterprise use cases. With YAML-driven contracts, clear prompts, and built-in governance, the generic MCP database server makes natural language access to databases safe, reliable, and future-proof.

By Baskar Sikkayan

From Test Automation to Autonomous Quality: Designing AI Agents for Data Validation at Scale

For a long time, quality engineering has been about building better nets to catch bugs after they fall out of the system. We wrote more tests, added more rules, and built bigger dashboards. And for a while, that worked. Then data systems grew teeth. Modern platforms now consume hundreds of data sources, handle millions of events per minute, support machine learning models, personalize user experiences, and inform business decisions in real time. At this scale, quality issues aren’t just bugs — they become systemic problems. Small changes to a schema, a missing field, or a slight variation in a data pattern can cascade through analytics systems and even impact revenue. In this world, traditional test automation starts to look less like a safety net and more like a static photograph of a moving object. This is where we believe the next shift is happening: from test automation to autonomous quality. Not more tests. Not more rules. But systems that actively observe, reason, adapt, and respond to the behavior of data itself. Why Test Automation Stops Scaling Classic test automation is built on a simple idea: if I can define the expected behavior, I can assert it. This works well for deterministic systems: APIs with fixed contractsWorkflows with known pathsInputs and outputs that change slowly But data platforms violate all these assumptions: Schemas evolve constantly.New sources appear and old ones disappear.Behavior changes gradually, not in discrete releases.Failures are often statistical, not binary. The hardest data issues aren’t “wrong values.” They’re shifts: A metric drifting slowly upwardA distribution becoming skewedA field becoming increasingly sparseA pattern that used to be normal becoming rare These shifts don’t trigger traditional tests — they trigger consequences. By the time someone notices, the damage is already done. That is why we need systems that don’t just validate but observe. What I Mean by an “AI Quality Agent” When I say AI agent, I don’t mean a magical black box that replaces engineers. I mean a system with four core capabilities: Observation: Continuously watches data flows, not just samples them.Understanding: Learns what “normal” looks like for each dataset.Reasoning: Detects when behavior meaningfully deviates.Action: Responds in ways that prevent or reduce harm. Think of it less like testing and more like an immune system. A quality agent doesn’t check a single record and ask, “Is this valid?” It watches the system and asks, “Is this behaving like itself?” This shift — from validating facts to monitoring behavior — is the core change. A Simple Reference Model The mental model I use for autonomous quality systems is: Data Flow → Observation → Pattern Learning → Anomaly Detection → Decision → Feedback Let’s unpack it. 1. Observation The agent passively watches: Event streamsDatabase changesSchema evolutionVolume, latency, null rates, cardinality, distributions This isn’t logging — it’s sensing. 2. Pattern Learning Over time, the agent learns: Which fields normally existHow often values appearWhat ranges are typicalWhich combinations occur together This becomes a living baseline, not a static specification. 3. Anomaly Detection The agent can now spot: Sudden drops or spikesGradual driftsNew, unseen patternsDisappearing signals Not every anomaly is a problem, but every problem is an anomaly. 4. Decision When something changes, the agent asks: Is this expected?Is this risky?Is this harmful?Is this likely a bug or a business change? Decisions may involve: Comparing with release eventsReviewing upstream changesCorrelating with other signals 5. Feedback Finally, the system responds: Alerts humansCreates ticketsBlocks pipelinesAuto-corrects where safeUpdates its own baselines The system doesn’t just detect issues — it learns from them. Example in Practice On one platform we worked on, a large event-driven data pipeline fed analytics, personalization, and reporting systems. It ingested data from dozens of upstream services, each evolving at its own pace. Traditional validation covered schemas and basic rules, yet issues still slipped through: fields slowly became sparse, events arrived in unexpected combinations, and values drifted enough to skew downstream metrics without triggering alarms. We introduced a simple agent-like layer that passively observed production data, tracking distributions, null rates, cardinality, and field relationships over time. It built a baseline of what “normal” looked like for each dataset. A few weeks later, the system flagged a subtle change: one event type, normally appearing in 18–20% of sessions, dropped to under 10%, even though no deployment had occurred. The data was technically valid — no schema break, no missing fields — but behavior was off. It turned out an upstream service had quietly changed a filtering condition, removing the event for a large user segment. Without behavioral monitoring, this would have gone unnoticed for weeks. Instead, the team caught it early, fixed it quickly, and avoided a silent distortion of analytics and personalization. The key lesson wasn't that the agent “found a bug.” It noticed a behavioral change before humans knew what to look for. That’s the difference between testing for known failures and watching for unknown ones. Why This Matters More Than Ever As organizations lean into AI, personalization, automation, and real-time decision-making, data is no longer just input — it’s a dependency. Bad data doesn’t just cause bugs; it causes: Biased modelsBroken personalizationMisleading metricsRegulatory exposureLoss of trust And trust, once lost, is expensive to regain. Autonomous quality protects trust at machine speed. This Is Not About Replacing Engineers AI agents do not replace human judgment — they amplify it. Agents handle: ScaleSpeedMonitoringNoise Humans handle: MeaningContextIntentTrade-offs The best systems are partnerships. Agents surface signals. Humans decide what they mean. This isn’t automation replacing people — it’s automation restoring people to the work that actually requires thought. Where This Leads We are gradually watching “quality” evolve from a process into a property of the system: not something you run, but something you build. Just as we expect systems to be observable, resilient, and secure by design, we will soon expect them to be self-aware of their own data health. Autonomous quality isn’t a product — it’s a capability. Like all useful capabilities, it won’t arrive fully formed. It will emerge piece by piece, from teams that stop asking, “How do we test this?” and start asking, “How do we let the system watch itself?” That’s the shift — and it’s already underway.

By Sandip Gami

Ralph Wiggum Ships Code While You Sleep. Agile Asks: Should It?

TL; DR: When Code Is Cheap, Discipline Must Come from Somewhere Else Generative AI removes the natural constraint that expensive engineers imposed on software development. When building costs almost nothing, the question shifts from “can we build it?” to “should we build it?” The Agile Manifesto’s principles provide the discipline that these costs are used to enforce. Ignore them at your peril when Ralph Wiggum meets Agile. The Nonsense About AI and Agile Your LinkedIn feed is full of confident nonsense about Scrum and AI. One camp sprinkles "AI-powered" onto Scrum practices like seasoning. They promise that AI will make your Daily Scrum more efficient, your Sprint Planning more accurate, and your Retrospectives more insightful. They have no idea what Scrum is actually for, and AI amplifies their confusion, now more confidently presented. (Dunning-Kruger as a service, so to speak.) The other camp declares Scrum obsolete. AI agents and vibe coding/engineering will render iterative frameworks unnecessary, they claim, because software creation will happen while you sleep at zero marginal cost. Scrum, in their telling, is rigid dogma unfit for a world of autonomous code generation; a relic in the new world of Ralph Wiggum-style AI development. Both camps miss the point entirely. The Expense Gate Ralph Wiggum Eliminates For decades, software development had a natural constraint: engineers were expensive. A team of five developers costs $750,000 or more annually, fully loaded. That expense imposed discipline. You could not afford to build the wrong thing. Every feature required justification. Every iteration demanded focus. The cost was a gate. It forced product decisions. Generative AI removes that gate. Code generation approaches zero marginal cost. Tools like Cursor, Claude, and Codex produce working code in minutes. Vibe coding turns product ideas into functioning prototypes before lunch. The trend is accelerating. Consider the "Ralph Wiggum" technique now circulating on tech Twitter and LinkedIn: an autonomous loop that keeps AI coding agents working for hours without human intervention. You define a task, walk away, and return to find completed features, passing tests, and committed code. The promise is seductive: continuous, autonomous development in which AI iterates on its own work until completion. Geoffrey Huntley, the technique's creator, ran such a loop for three consecutive months to produce a functioning programming language compiler. [1] Unsurprisingly, the marketing writes itself: "Ship code while you sleep." But notice what disappears in this model: Human judgment about what is worth building. Review cycles that catch architectural mistakes. The friction that forces teams to ask whether a feature deserves to exist. As one practitioner observed about these autonomous loops: "A human might commit once or twice a day. Ralph can pile dozens of commits into a repo in hours. If those commits are low quality, entropy compounds fast." [2] The expense gate is gone. The abundance feels liberating. It is also dangerous. Without the expense gate, what prevents teams from running in the wrong direction faster than ever? What stops organizations from generating mountains of features that nobody wants? What enforces the discipline that cost used to provide? The Principles Provide the Discipline The answer is exactly what the Agile Manifesto was designed to provide. Start with the first value: "Working software over comprehensive documentation." In an AI world, generating documentation is trivial. Generating working software is trivial. But generating working software that solves actual customer problems remains hard. The emphasis on "working" was never about the code compiling. It was about the software doing something useful. That distinction matters more now, not less. Then there is simplicity: "the art of maximizing the amount of work not done." When engineers cost $150K annually, leaving out features of questionable value saved money. Now that building costs almost nothing, leaving features out requires discipline rather than economics. The product person who asks "should we build this?" instead of "can we build this?" becomes more valuable, not less. "Working software is the primary measure of progress." AI can generate a thousand lines of code per hour. None of those represents progress itself. Instead, progress is measured by working software in users' hands who find it useful. Customer collaboration and feedback loops provide that measurement. Output velocity without validation is a waste at unprecedented scale. And then technical excellence: "Continuous attention to technical excellence and good design enhances agility." This principle now separates survival from failure. The Technical Debt Trap Autonomous AI development produces code that works well enough to ship. The AI generates plausible implementations that pass tests and satisfy immediate requirements. Six months later, the same team discovers the horror beneath the surface. You build it, you ship it, you run it. And now you maintain it. This is "artificial" technical debt compounding at unprecedented rates. The Agile Manifesto called for "sustainable development" and for teams to maintain "a constant pace indefinitely." These were not bureaucratic overhead invented by process enthusiasts. They were survival requirements learned through painful experience. Organizations that abandon these principles because AI makes coding cheap will discover a familiar pattern: initial velocity followed by grinding slowdown. The code that was so easy to generate becomes impossible to maintain. The features that shipped so quickly become liabilities that cannot be safely modified. Technical excellence is not optional in an AI world. It is the difference between a product and a pile of unmaintainable code. The "Should We Build It" Reframe The fundamental question of product development has always been: are we building the right thing? When building was expensive, the expense itself forced that question. Teams could not afford to build everything, so they had to choose. Product people had to prioritize ruthlessly. Stakeholders had to make tradeoffs. Now that building is cheap, the forcing function is gone. Organizations can build everything. Or at least they think they can. The pressure compounds from above. Management and stakeholders are increasingly factoring in faster product delivery enabled by AI capabilities. Late changes that once required difficult conversations now seem costless. Prototypes that once took weeks can appear in hours. The expectation becomes: if AI can build it faster, why are we not shipping more? This pressure makes disciplined product thinking harder precisely when it matters most. The Agile Manifesto's emphasis on "customer collaboration" and "responding to change" exists precisely because requirements emerge through discovery, not specification. Feedback loops with real users matter more when teams can produce working software faster. Without those loops, teams generate features in a vacuum, disconnected from the people who must find them valuable. The product person who masters this discipline becomes irreplaceable. The product person who treats the backlog as a parking lot for every idea becomes a liability at scale, approving AI-generated waste faster than ever before. What Stays, What Changes in the Age of Ralph Wiggum & Agile The core feedback loops remain essential: build something small, show it to users, learn from the response, adapt. That rhythm predates any framework. It will outlast whatever comes next. Iteration cycles may compress. If teams can produce meaningful working software in days rather than weeks, shorter cycles make sense. The principle remains: deliver working software frequently. The specific cadence adapts to capability. The challenge function becomes more critical, not less. In effective teams, Developers have always pushed back on product suggestions: "Is this really the most valuable thing we can build to solve our customers' problems?" This tension is healthy. Life is a negotiation, and so is Agile. When AI can generate implementation options in minutes, this challenge function becomes the primary source of discipline. The question shifts from "how long will this take?" to "should we build this at all?" and "how will we know it works?" Customer feedback loops matter more when velocity increases. These loops have always been about closing the gap between what teams build and what customers need, inspecting progress toward meaningful outcomes, and adapting the path when reality contradicts assumptions. When teams can produce more working software faster, these checkpoints become sharper. The question shifts from "look what we built" to "based on what we learned, what should we build next?" Daily coordination adapts in form, not purpose. The goal remains: inspect progress and adapt the plan. Standing in a circle reciting yesterday's tasks has always been useless compared to answering: are we still on track, and what is blocking us? Now, it becomes critical: faster implementation cycles make frequent synchronization more important, not less. Technical discipline becomes survival, not overhead. The harder problem is helping teams maintain quality standards when shipping is frictionless. Practitioners who can spot AI-generated code smell, who insist on meaningful review, who protect quality definitions from erosion under delivery pressure: these people become more valuable. Those who focus primarily on the "process," delivered dogmatically, become redundant. Product accountability becomes the constraint, and that is correct. When implementation is cheap, product decisions become the bottleneck. The person who can rapidly validate assumptions, say no to plausible but valueless features, and maintain focus becomes the team's most critical asset. These are adaptations, not abandonment. The principles survive because they address a permanent problem: building software that solves customer problems in complex environments. AI changes the cost structure. It does not change the problem. We Are Not Paid to Practice Scrum I have said this before, and it applies directly here: we are not paid to practice Scrum. We are paid to solve our customers' problems within the given constraints while contributing to the organization's sustainability. Full disclosure: I earn part of my living training people in Scrum. I have skin in this game. But the game only matters if Scrum actually helps teams deliver value. If Scrum helps accomplish your goals, use Scrum. If parts of Scrum no longer serve that goal in your context, adapt. The Scrum Guide itself says Scrum is a framework, not a methodology. It is intentionally incomplete. The "Scrum is obsolete" camp attacks a caricature: rigid ceremonies enforced dogmatically without regard for outcomes. That caricature exists in some organizations. It is not Scrum. It is a bad implementation that the Agile Manifesto warned against in its first value: "Individuals and interactions over processes and tools." The question is not whether to practice Agile by the book. The question is whether your team has the feedback loops, the discipline, and the customer focus to avoid building the wrong thing at AI speed. If you have those things without calling them Agile, fine. Call it whatever you want. The labels do not matter. The outcomes do. If you lack those things, AI will not save you. It will accelerate your failure. Conclusion: Do Not Outsource Your Thinking The tools have changed. The fundamental challenge has not. Building software that customers find valuable, in complex environments where requirements emerge through discovery rather than specification, remains hard. The expense gate is gone, but the need for discipline remains. The Agile Manifesto's principles provide that discipline. They are not relics of a pre-AI world. They are the antidote to AI-accelerated waste. Do not outsource your thinking to AI. The ability to generate code instantly does not answer the question that matters. Just because you could build it, should you? What discipline has replaced the expense gate in your organization? Or has nothing replaced it yet? I am curious. Ralph Wiggum and Agile: The Sources Ralph Wiggum: Autonomous Loops for Claude Code11 Tips For AI Coding With Ralph Wiggum

By Stefan Wolpers

CORE

ToolOrchestra vs Mixture of Experts: Routing Intelligence at Scale

Last year, I came across Mixture of Experts (MoE) through this research paper published in Nature. Later in 2025, Nvidia published a research paper on ToolOrchestra. While reading the paper, I kept thinking about MoE and how ToolOrchestra is similar to or different from it. In this article, you will learn about two fundamental architectural patterns reshaping how we build intelligent systems. We'll explore ToolOrchestra and Mixture of Experts (MoE), understand their inner workings, compare them with other routing-based architectures, and discover how they can work together. What Is Mixture of Experts? Simply put, Mixture of Experts is an architectural pattern that splits a large model into multiple specialized sub-networks called experts. Instead of one monolithic model handling every input, you activate only the experts needed for each specific task. The concept dates back to 1991 with the paper "Adaptive Mixture of Local Experts." The core idea is straightforward: route each input to the most suitable expert, activate only what you need, and keep the rest idle. Mixture of Experts How MoE Works In transformer models, MoE layers typically replace the feedforward layers. These feedforward layers consume most of the compute as models scale. Replace them with MoE, and you get massive efficiency gains. Key components: Gating network – Decides which experts process which tokensExperts – Specialized sub-networks (typically feedforward networks)Load balancing – Ensures no single expert gets overwhelmedSparse activation – Only activates selected experts per token Routing strategies: StrategyDescriptionExample ModelTop-1Each token goes to one expertSwitch TransformerTop-2Each token goes to two expertsGShard, Mixtral 8x7BExpert ChoiceExperts select tokensExpert Choice RoutingSoft RoutingWeighted combination of all expertsSoft MoE What Is ToolOrchestra? ToolOrchestra, introduced by NVIDIA researchers in November 2025, takes a different approach. Instead of splitting one model into parts, it uses a small 8-billion-parameter model to coordinate multiple complete models and tools. Think of it as a conductor leading an orchestra. The orchestrator model analyzes a problem, breaks it down, and calls different "instruments" to solve each piece. ToolOrchestra flow How ToolOrchestra Works The breakthrough is in how it learns to orchestrate. ToolOrchestra uses reinforcement learning with three reward types: Reward structure: Reward TypePurposeFocusOutcomeGetting the right answerCorrectnessEfficiencyUsing cheaper tools when possibleCost optimizationPreferenceRespecting user tool preferencesUser control The training uses a synthetic data pipeline called ToolScale. It automatically generates databases, API schemas, and complex tasks with verified solutions. This gives the orchestrator thousands of examples to learn from through trial and error. Core Differences Let me break down the fundamental differences between these two approaches: AspectMixture of ExpertsToolOrchestraGranularityToken-level routingTask-level routingScopeWithin a single modelAcross multiple systemsComponentsSub-networks (experts)Complete models and toolsTrainingJoint training of all expertsOnly orchestrator trainsActivationSparse parameter activationSelective system invocationMemoryAll experts in memoryTools loaded on demandExternal AccessNo external toolsWeb, APIs, databases The fundamental difference is in what gets split up. MoE splits a single model's parameters into specialized sub-networks. All experts live inside one model architecture, sharing the same input and working on the same task at the parameter level. ToolOrchestra splits tasks across different complete systems. The orchestrator is a small, standalone model that coordinates other models and tools. Each tool or model it calls is fully independent, potentially running on different hardware, using different architectures, and even created by different companies. ToolOrchestra vs. MoEs Commonalities and Shared Principles Both architectures attack the same problem: inefficiency. Running a massive model for every task wastes compute and money. MoE and ToolOrchestra both use sparsity and specialization to avoid this waste. Shared design principles: Routing as a core mechanism – Both use learned routing to direct inputs to the right specialistsModularity – Break down monolithic systems into specialized componentsSparsity – Activate only what you need for each inputAutomatic learning – Routing policies are learned, not hardcodedSpecialization over generalization – Focused experts outperform generalists on specific tasks Related Architectural Methods Several other patterns fit into this landscape of modular, routing-based intelligence. Let me walk you through the key ones. Before diving into the specific architectures, I want to mention that I've written extensively about AI infrastructure and optimization techniques. If you're interested in understanding how these architectural patterns work in production environments, check out my article on NVIDIA MIG with GPU Optimization in Kubernetes, which covers how GPU partitioning works similarly to expert routing in MoE systems. Ensemble Methods Ensemble learning combines predictions from multiple models. Unlike MoE, where routing is learned, ensembles often use simpler combination strategies. Common ensemble techniques: TechniqueHow It WorksBest ForBaggingTrain on different data subsetsReducing varianceVotingMajority vote or averagingClassification tasksStackingMeta-learner combines predictionsComplex problemsWeighted AverageLearned weights for each modelRegression tasks Ensemble method architecture The key difference from MoE is that ensemble methods typically run all models for every input, then combine results. MoE only activates selected experts. Ensemble methods are simpler to implement but less efficient. Capsule Networks With Dynamic Routing Capsule Networks, introduced by Geoffrey Hinton in 2017, use a routing mechanism called "routing-by-agreement." While different from MoE and orchestration, capsules share the idea of learned routing. How capsule routing works: Instead of routing tokens to experts, capsules route outputs to higher-level capsules based on agreement. Lower-level capsules send their output vectors to higher-level capsules that "agree" with their predictions. Capsule routing Key concepts: Capsules as vectors: Unlike neurons that output scalars, capsules output vectors. The length represents probability, the direction represents properties.Dynamic routing: Iteratively updates routing coefficients based on agreement between predictionsSpatial relationships: Better at understanding part-whole relationships in images FeatureTraditional CNNCapsule NetworkOutputScalar activationsVector capsulesRoutingFixed poolingDynamic routingSpatial InfoLost through poolingPreserved in vectorsIterationsNone3-5 routing iterations Multi-Agent Neural Systems These architectures organize intelligence as multiple cooperating agents. Each agent is a specialized neural network that communicates with others. Agent-based architectures: Modular Graph Neural Networks (ModGNN) – Agents communicate through graph structures for multi-agent coordinationNeural Agent Networks (NAN) – Distributed systems where agents act like neuronsAgentic Neural Networks – Self-evolving systems that optimize both structure and prompts Multi-agent systems Comparison with MoE and Orchestration: ArchitectureCommunicationIndependenceTrainingMoEThrough gatingSub-networksJointToolOrchestraThrough orchestratorFully independentSeparateMulti-AgentPeer-to-peerSemi-independentCan be joint or separate Hierarchical Mixture of Experts HMoE adds layers to the basic MoE pattern. First-level routing decides broad categories, then second-level routing picks specific experts. HMoE routing This pattern is similar to ToolOrchestra's hierarchical potential. Both can build multi-level routing systems. Retrieval-Augmented Generation (RAG) RAG combines language models with retrieval systems. Before generating, the system searches a database for relevant information. RAG architecture: ComponentPurposeSimilar ToQuery EncoderTransform inputMoE gatingRetrieverFind relevant docsTool selectionReader/GeneratorProduce answerExpert activation RAG is closer to orchestration than MoE. The retriever acts like a tool, and the generator coordinates between input and retrieved knowledge. Compound AI Systems Compound AI Systems, as defined by Berkeley AI Research, tackle tasks using multiple interacting components. This is the broad category that includes both orchestration and some ensemble methods. I recently wrote about the Model Context Protocol (MCP), which is Anthropic's approach to standardizing how AI systems connect with external data sources. MCP represents a compound AI pattern where models orchestrate access to various data sources through a universal protocol. The principles align closely with ToolOrchestra's approach to coordinating multiple tools. Characteristics: Multiple model calls in sequence or parallelExternal tools (databases, APIs, code execution)Retrieval and generation combinedMulti-step reasoning chains Examples: Chain-of-Thought systems – Break problems into reasoning stepsReAct (Reasoning + Acting) – Combine reasoning with tool useAutoGPT-style agents – Autonomous task decomposition and execution Neural-Symbolic Architectures These systems integrate neural networks with symbolic reasoning. The neural part handles pattern recognition, the symbolic part handles logical reasoning. Layered architecture: LayerTypeFunctionPerceptionNeuralProcess sensory inputReasoningSymbolicApply logical rulesPlanningHybridCombine both approaches This is similar to orchestration, where different tools have different capabilities. The routing decides whether to use neural or symbolic processing. Comparison Table: All Architectures ArchitectureRouting TypeComponentsTrainingBest Use CaseMoEToken-level, learnedSub-networksJointParameter efficiencyToolOrchestraTask-level, learnedIndependent systemsOrchestrator onlyFlexible compositionEnsembleNo routing / simpleIndependent modelsSeparateReducing varianceCapsule NetworksAgreement-basedVector capsulesJointSpatial relationshipsMulti-AgentPeer communicationAutonomous agentsJoint or separateComplex coordinationHMoEMulti-level, learnedHierarchical expertsJointNested specializationRAGQuery-basedRetriever + GeneratorSeparateKnowledge groundingCompound AIMulti-stepChains of componentsMixedComplex workflows Implementation Considerations When building systems with these architectures, keep these points in mind: When to use MoE: Training massive models from scratchNeed parameter efficiency at inferenceSingle model deployment preferredHave expertise in distributed training When to use ToolOrchestra: Building applications with existing modelsNeed to swap components frequentlyWant to use external tools (APIs, databases)Prefer faster iteration and easier maintenance When to use ensembles: Have multiple trained modelsWant simple implementationCan afford running all modelsNeed variance reduction When to use multi-agent: Complex coordination neededAgents should learn from each otherReal-time communication requiredDistributed decision making For those interested in the infrastructure side of deploying these architectures, I've written several articles that might help. My piece on Multizone Kubernetes and VPC Load Balancer Setup shows how to deploy distributed systems across zones, which is similar to how you'd deploy multiple experts or orchestrated models. I've also published guides on DZone about cloud infrastructure and automation that apply directly to deploying these kinds of systems. Conclusion The trend is clear: break things into specialized components, learn to route intelligently, activate only what you need. Whether that specialization happens inside a model through MoE, across models through orchestration, or through multi-agent coordination, the principle holds. Emerging patterns: Multi-level routing – Orchestration at application level, MoE at model level, capsule-like routing for spatial featuresDynamic expert creation – Models that spawn new experts as needed based on task distributionCross-architecture routing – Systems that route between fundamentally different architectures (transformers, RNNs, symbolic systems)Learned cost functions – Systems that optimize for user-specific cost/quality tradeoffsFederated orchestration – Orchestrators coordinating models across different organizations Research directions: Better routing algorithms that generalize across domainsAutomatic architecture search for routing patternsEfficient training methods for sparse systemsTheoretical understanding of when routing helpsCombining symbolic and neural routing Key Takeaways If you've made it this far, here's what you should remember: MoE splits parameters, ToolOrchestra splits tasks, Ensembles split predictions, and multi-agents split responsibilities. All use routing, but at different levels and for different purposes.They complement each other. An orchestrator can coordinate MoE models. Capsule networks can use MoE-style experts. Multi-agent systems can use orchestration for high-level coordination. The combinations are endless.The future is modular. Neither approach alone is the answer. The next generation of systems will use multiple levels of routing and specialization working together.Start small, scale up. You don't need to build everything at once. Start with simple routing logic. Add specialized components. Layer on complexity as you learn what works. This isn't just about saving compute. It's about making intelligence more accessible, more controllable, and more aligned with how we actually want to use it. Breaking monolithic systems into specialized, coordinated components is how we'll build the next generation of AI. For those just starting their AI journey, my Awesome-AI guide provides a comprehensive roadmap for mastering machine learning and deep learning, which forms the foundation for understanding these advanced architectures. Further reading: ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration (Su et al., NVIDIA, November 2025) — The foundational paper introducing orchestration with RLSwitch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Fedus et al., Google Research, 2021) — Pioneering work on MoE scalingDynamic Routing Between Capsules (Sabour et al., Hinton, 2017) — Introduction to capsule networks and routing-by-agreementThe Shift from Models to Compound AI Systems (Zaharia et al., Berkeley AI Research, 2024) — Defining compound AI systemsCompound AI Systems Optimization: A Survey (Lee et al., 2025) — Comprehensive survey of optimization methodsOptimizing Model Selection for Compound AI Systems (Chen et al., Stanford/Berkeley, 2025) — LLMSelector frameworkTowards Resource-Efficient Compound AI Systems (Chaudhry et al., 2025) — Resource optimization approaches

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Essential Techniques for Production Vector Search Systems, Part 3: Filterable HNSW

After implementing vector search systems at multiple companies, I wanted to document efficient techniques that can be very helpful for successful production deployments of vector search systems. I want to present these techniques by showcasing when to apply each one, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each technique. Before we get into the real details, let us look at the prerequisites and setup. For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below. Steps to Set Up Qdrant Cloud Step 1: Get a Free Qdrant Cloud Cluster Sign up at https://cloud.qdrant.io.Create a free cluster Click "Create Cluster."Select Free Tier.Choose a region closest to you.Wait for the cluster to be provisioned.Capture your credentials. Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333.API Key: Click "API Keys" → "Generate" → Copy the key. Step 2: Install Python Dependencies PowerShell pip install qdrant-client fastembed numpy Recommended versions: qdrant-client >= 1.7.0fastembed >= 0.2.0numpy >= 1.24.0python-dotenv >= 1.0.0 Step 3: Set Environment Variables or Create a .env File PowerShell # Add to your ~/.bashrc or ~/.zshrc export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333" export QDRANT_API_KEY="your-api-key-here" Create a .env file in the project directory with the following content. Remember to add .env to your .gitignore to avoid committing credentials. PowerShell # .env file QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333 QDRANT_API_KEY=your-api-key-here Step 4: Verify Connection We can verify the connection to the Qdrant collection with the following script. From this point onward, I am assuming the .env setup is complete. Python from qdrant_client import QdrantClient from dotenv import load_dotenv import os # Load environment variables from .env file load_dotenv() # Initialize client client = QdrantClient( url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"), ) # Test connection try: collections = client.get_collections() print(f" Connected successfully!") print(f" Current collections: {len(collections.collections)}") except Exception as e: print(f" Connection failed: {e}") print(" Check your .env file has QDRANT_URL and QDRANT_API_KEY") Expected output: Plain Text python verify-connection.py Connected successfully! Current collections: 2 Now that we have the setup out of the way, we can get into the meat of the article. Before the deep dive into filterable HNSW, let us look at a high-level overview of the techniques we are about to cover in this multi-part series. Techniqueproblems solvedperformance impactcomplexityHybrid SearchWe will miss exact matches if we employ semantic search purely.Huge increase in the accuracy, closer to 16%MediumBinary QuantizationMemory costs scale linearly with data.40X memory reduction, 15% fasterLowFilterable HNSWNot a good practice to apply post-filtering as it wastes computation.5X faster filtered queriesMediumMulti Vector SearchA single embedding will not be able to capture the importance of various fields.Handles queries from multiple fields, such as title vs description, and requires two times more storage.MediumRerankingOptimized vector search for speed over precision.Deeper semantic understanding, 15-20% ranking improvementHigh Keep in mind that production systems typically combine two to four of these techniques. For example, a typical e-commerce website might use hybrid search, binary quantization, and filterable HNSW. We covered Hybrid Search in the first part of the series and Binary Quantization in the second part. In this part, we will dive into filterable HNSW. Filterable HNSW To understand how filterable HNSW is advantageous, let us look at how traditional filtering approaches, whether pre- or post-filter, waste computation. Post-filtering discards 90% of retrieved results, whereas pre-filtering reduces the search space so much that vector similarity becomes less significant. That is where filterable HNSW comes in handy, as it applies filters during the HNSW graph traversal. In other words, the algorithm navigates only through graph nodes that satisfy filter conditions. With components such as payload indexes (fast lookup structures for filterable fields), filter-aware traversal (HNSW navigation skips non-matching nodes), and dynamic candidate expansion (automatically fetch more candidates when filters are restrictive), the filterable HNSW is the way to go. Let us take a look at it in more detail with the code below. Python """ Example usage of the filterable_hnsw module. This demonstrates how to use Filterable HNSW with your own Qdrant collection. """ from filterable_hnsw import ( filterable_search, compare_filtered_unfiltered, display_filtered_results, get_qdrant_client ) from dotenv import load_dotenv import os load_dotenv() # Initialize client client = get_qdrant_client() # Your collection name COLLECTION_NAME = "automotive_parts" # Change this to your collection name # Example 1: Filtered search print("=" * 80) print("EXAMPLE 1: Filtered Search (Filterable HNSW)") print("=" * 80) print("Searching: 'engine sensor' with category filter") print("Expected: Finds semantically similar parts within the specified category\n") query1 = "engine sensor" # First get unfiltered results to see what categories exist unfiltered_test1 = filterable_search( collection_name=COLLECTION_NAME, query=query1, filter_conditions=None, client=client, limit=1 ) # Extract category from first result if available if unfiltered_test1 and 'category' in unfiltered_test1[0]['payload']: actual_category1 = unfiltered_test1[0]['payload']['category'] filter1 = {"category": actual_category1} print(f"Using category from data: '{actual_category1}'\n") else: filter1 = {"category": "Engine Components"} # Fallback filtered_results = filterable_search( collection_name=COLLECTION_NAME, query=query1, filter_conditions=filter1, client=client, limit=5 ) display_filtered_results( filtered_results, query1, show_fields=['part_name', 'part_id', 'category', 'description'] ) print("\n\n") # Example 2: Comparison between Filterable HNSW and Post-Filtering print("=" * 80) print("EXAMPLE 2: Filterable HNSW vs Post-Filtering Comparison") print("=" * 80) print("Comparing filtering DURING traversal vs filtering AFTER retrieval") print("Expected: Shows Filterable HNSW is more efficient (no wasted computation)\n") query2 = "brake system" # First get unfiltered results to see what categories exist unfiltered_test2 = filterable_search( collection_name=COLLECTION_NAME, query=query2, filter_conditions=None, client=client, limit=1 ) # Extract category from first result if available if unfiltered_test2 and 'category' in unfiltered_test2[0]['payload']: actual_category2 = unfiltered_test2[0]['payload']['category'] filter2 = {"category": actual_category2} print(f"Using category from data: '{actual_category2}'\n") else: filter2 = {"category": "Braking System"} # Fallback comparison = compare_filtered_unfiltered( collection_name=COLLECTION_NAME, query=query2, filter_conditions=filter2, client=client, limit=5 ) print("\n\n") # Example 3: Display detailed comparison print("=" * 80) print("EXAMPLE 3: Detailed Result Comparison") print("=" * 80) print("Top results from both methods:\n") print("Post-Filtered Results (Top 3):") print("-" * 80) for i, result in enumerate(comparison["post_filtered"]["results"][:3], 1): payload = result["payload"] name = payload.get('part_name', payload.get('name', 'Unknown')) category = payload.get('category', 'N/A') print(f"{i}. {name}") print(f" Category: {category}") print(f" Score: {result['score']:.4f}") print(f" ID: {result['id']}") print("\nFilterable HNSW Results (Top 3):") print("-" * 80) for i, result in enumerate(comparison["filtered"]["results"][:3], 1): payload = result["payload"] name = payload.get('part_name', payload.get('name', 'Unknown')) category = payload.get('category', 'N/A') print(f"{i}. {name}") print(f" Category: {category}") print(f" Score: {result['score']:.4f}") print(f" ID: {result['id']}") print("\n" + "=" * 80) print("SUMMARY:") print("=" * 80) print("Filterable HNSW:") print(" - Filters DURING graph traversal (not before or after)") print(" - Only navigates through nodes that satisfy filter conditions") print(" - No wasted computation - doesn't retrieve then discard results") print(" - More efficient than post-filtering which wastes >90% computation") print(f" - In this example: {comparison['overlap_ratio']*100:.1f}% result overlap") Let us now look at the Filterable HNSW in action with the implementation output Plain Text ================================================================================ EXAMPLE 1: Filtered Search (Filterable HNSW) ================================================================================ Searching: 'engine sensor' with category filter Expected: Finds semantically similar parts within the specified category Using category from data: 'Safety Systems' Filtered Search Results for: 'engine sensor' ================================================================================ Found 5 results 1. Safety Sensor Module 237 Part_name: Safety Sensor Module 237 Part_id: DEL-0000237 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.4092 -------------------------------------------------------------------------------- 2. Safety Sensor Module 240 Part_name: Safety Sensor Module 240 Part_id: BOS-0000240 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.4052 -------------------------------------------------------------------------------- 3. Safety Sensor Module 242 Part_name: Safety Sensor Module 242 Part_id: VAL-0000242 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.4004 -------------------------------------------------------------------------------- 4. Safety Sensor Module 246 Part_name: Safety Sensor Module 246 Part_id: CON-0000246 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.3983 -------------------------------------------------------------------------------- 5. Safety Sensor Module 234 Part_name: Safety Sensor Module 234 Part_id: ZF-0000234 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.3978 -------------------------------------------------------------------------------- ================================================================================ EXAMPLE 2: Filterable HNSW vs Post-Filtering Comparison ================================================================================ Comparing filtering DURING traversal vs filtering AFTER retrieval Expected: Shows Filterable HNSW is more efficient (no wasted computation) Using category from data: 'Braking System' Comparing Filterable HNSW vs Post-Filtering for: 'brake system' Filters: {'category': 'Braking System'} ================================================================================ 1. Post-Filtering (Inefficient) Retrieves many results, then filters AFTER retrieval -------------------------------------------------------------------------------- 2. Filterable HNSW (Efficient) Filters DURING graph traversal - only navigates matching nodes -------------------------------------------------------------------------------- ================================================================================ COMPARISON SUMMARY ================================================================================ Post-Filtering (Traditional Approach): Time: 126.94 ms Results: 5 Approach: Retrieves 50 candidates, discards 45 Top Score: 0.6419 Filterable HNSW: Time: 79.26 ms Results: 5 Approach: Only navigates through nodes matching filter conditions Top Score: 0.6419 Overlap: Common Results: 5 / 5 (100.0%) Filterable HNSW is 1.60x faster Key Difference: Post-Filtering: Wastes computation by retrieving and discarding results Filterable HNSW: Filters during graph traversal - no wasted computation ================================================================================ ================================================================================ EXAMPLE 3: Detailed Result Comparison ================================================================================ Top results from both methods: Post-Filtered Results (Top 3): -------------------------------------------------------------------------------- 1. Brake Control Component 168 Category: Braking System Score: 0.6419 ID: 1794233379 2. Brake Control Component 154 Category: Braking System Score: 0.6396 ID: 3151300734 3. Brake Control Component 176 Category: Braking System Score: 0.6394 ID: 1517692434 Filterable HNSW Results (Top 3): -------------------------------------------------------------------------------- 1. Brake Control Component 168 Category: Braking System Score: 0.6419 ID: 1794233379 2. Brake Control Component 154 Category: Braking System Score: 0.6396 ID: 3151300734 3. Brake Control Component 176 Category: Braking System Score: 0.6394 ID: 1517692434 ================================================================================ SUMMARY: ================================================================================ Filterable HNSW: - Filters DURING graph traversal (not before or after) - Only navigates through nodes that satisfy filter conditions - No wasted computation - doesn't retrieve then discard results - More efficient than post-filtering which wastes >90% computation - In this example: 100.0% result overlap Benefits As you can clearly see from the results, filterable HNSW offers computational efficiency, achieving 1.6 times faster performance. There is also no wasted computation, as you can see from the results, post filtering retrieved 50 items and discarded 45 of them, whereas filterable HNSW only navigated nodes matching the "breaking system" category. The results are also guaranteed for good quality, as you can see from the overlap (all 5 results are identical between methods). Costs For us to be able to execute filterable HNSW, we have a payload index overhead in creating an index for the category, supplier, and in_stock field. For a million parts, we are looking at a minimum of 6% overhead. Also, we need to consider the maintenance aspect of it, as every new part indexed must update the payload indexes. Also to keep in mind is the fact that complex OR conditions may degrade performance on the filtering. Also, payload indexes are kept in RAM for faster access, so there is no need to account for this in capacity planning. When to Use When the results are frequently filteredWhen the filters are selective (reduce results by more than 50%)When the data has categorical/structured metadata When Not to Use When filters are rarely usedFilters are not selective (remove less than 20% of results)Very small datasets (less than 10,000 items Efficiency Comparison Approachcandidates retrievedresults returnedwasted workcpu efficiencyPost Filtering50545 (90%)10% EfficientFilterable HNSW550 (0%)100 % efficient Performance Characteristics Based on the results, let us now look at the performance characteristics Metricpost filteringfilterable hnswevidence from the dataQuery Latency126.94ms79.26ms1.6 times faster Wasted Computation90%0%No wasted computation by filterable HNSWResult Quality0.6419 (top score)0.6419 (top score)100% overlapMemory OverheadBaseline+5-10%Payload indexes for the categories and other fieldsScalabilityDegrades with SelectivityConstant PerformanceMore selective filter, bigger speedup for filterable HNSW Conclusion We have looked at the concept and also the results for filterable HNSW and concluded that the more selective the filters are, the better the output for the results. The bottom line is that if more than 30% of your queries use filterable HNSW, unlike the previous two techniques discussed in the series, filterable HNSW just gives pure gain and no overheads. In the next part of the series, we will look at multi-vector search and its advantages and disadvantages.

By Pavan Vemuri

CORE

5 Technical Strategies for Scaling SaaS Applications

Growing a business is every owner’s dream — until it comes to technical scaling. This is where challenges come to the surface. They can be related to technical debt, poor architecture, or infrastructure that can’t handle the load. In this article, I want to take a closer look at the pitfalls of popular SaaS scaling strategies, drawing from my personal experience. I’ll share lessons learned and suggest practices that can help you navigate these challenges more effectively. 1. Horizontal Application Scaling Horizontal scaling is usually the default strategy once an app reaches moderate traffic. Most SaaS apps run on cloud infrastructure, so spinning up extra instances via auto-scaling is easy. But in many cases, horizontal scaling alone is not enough. I worked on a SaaS project that provided real-time analytics dashboards for e-commerce stores. As we started scaling, the system ran into performance issues. The dashboards were making a lot of requests to the sales data, and the underlying database was reaching its CPU and I/O limits. Adding more app instances only generated more read requests, worsening the problem. To solve this, we combined horizontal scaling of the app servers with read replicas for the database and caching for frequently accessed dashboard data. This way, the app could serve more concurrent users, and the database wasn’t overwhelmed. At the same time, we still took advantage of horizontal scaling to handle traffic spikes. So even if you use proven approaches, remember that scaling a SaaS app requires more than simply adding servers. You must also coordinate strategies across databases, background jobs, and caching layers. 2. Tenant-Based Resource Isolation Multi-tenant resource isolation is another critical strategy for SaaS scaling. While it may seem obvious that all customers share the same system resources, problems often arise when usage patterns vary significantly across tenants. Designing a multi-tenant architecture is challenging on its own, and it becomes even harder when clients have different needs. For example, in one project, I encountered a situation where a few large customers ran campaigns that triggered hundreds of background jobs simultaneously. Even with auto-scaling and sufficient app servers, these tenants consumed most of the queue and CPU resources. We implemented per-tenant limits on concurrent jobs and queue partitions, with dedicated worker pools for heavy tenants. This ensured that high-usage customers could run their campaigns without affecting others’ experience. I also recommend setting up continuous monitoring of tenant behavior and adjusting limits as needed, so no single customer can impact the experience of others. 3. Independent Scaling of Components The main challenge of this approach is maintaining a clear separation of components and responsibilities. Independent component scaling is most effective when workloads have very different characteristics. For smaller systems with uniform traffic, the added operational complexity may not be worth it. The best way to implement independent scaling is to decouple each part of your system so that changes in one component don’t force changes in others. Give each component its own deployment pipeline and implement fault isolation so failures don’t cascade across the system. I often see teams rely solely on CPU or memory usage to decide what to scale. In my experience, it’s far more effective to consider workload-specific metrics such as queue length, requests per second, or processing rate. These metrics directly reflect real demand. 4. API Integrations SaaS apps typically rely on external APIs for payments, notifications, analytics, or third-party integrations. Scaling requires making these integrations reliable, non-blocking, and resilient under load. If you didn’t adopt an API-first design early on, this can be challenging. Here are several best practices. First, move third-party API calls to background jobs. External services are often slow or unpredictable, so offloading these calls keeps user-facing requests fast and allows retries and error handling to happen asynchronously. Next, implement retries with exponential backoff and circuit breakers. This prevents temporary failures from cascading through your system and overwhelming queues or downstream services. It’s also important to cache responses when appropriate. If an API returns relatively static data, caching reduces unnecessary calls and conserves API quotas. 5. Introducing AI When discussing modern scaling strategies, we can’t ignore AI. AI tools can help scale engineering capacity and improve system quality at the same time. Many businesses now use AI-assisted workflows to improve code quality, testing, and deployment. In my experience, AI can be a major help. As systems grow, codebases become more complex. AI can analyze code, identify overly complex functions or duplicated logic, and suggest refactoring before technical debt accumulates. I’ve found AI particularly useful for testing, which is often a bottleneck when scaling. My team uses GitHub Copilot to generate tests for recent code changes, helping us maintain coverage without writing every test manually. That said, it’s important to remember AI’s limitations. Always combine AI-generated tests with human review for edge cases, and regularly check coverage to ensure nothing is missed. Final Thoughts It’s important not to fall into the trap of default solutions. Every SaaS application presents unique scaling challenges, and success depends on adapting well-known practices with techniques and technologies that fit your context. Some applications struggle with database scaling, others with API performance, and still others with operational complexity or team coordination. The key is to identify your system’s real bottlenecks and build strategies that address them directly.

By Mykhailo Kopyl