Coding Resources

DZone's Featured Coding Resources

AI Paradigm Shift: Analytics Without SQL

By Haricharan Shivram Suresh Chandra Kumar

The idea of “asking data questions in plain English” has been around for a while, but most implementations never made it into production in a serious way. The usual reason is not the language model itself but everything around it: security boundaries, schema ambiguity, cost control, and the fact that analytics systems are rarely clean enough for unconstrained natural language to work reliably. What has changed in the last couple of years is not that natural language is suddenly perfect, but that data platforms have started bringing computation, metadata, and AI into the same controlled environment. One example of this approach is the way agents are being built directly inside data warehouses like Snowflake. The important detail is not the brand itself, but the architectural pattern: the model, the data, and the execution layer sit together rather than being stitched across multiple systems. That shift changes how analytics tools are designed. Instead of building external “AI layers” on top of a warehouse, teams are embedding the agent logic inside the warehouse itself using tools like Snowpark and managed LLM services such as Snowflake Cortex. The result is a system where natural language is just another input format, not a separate application tier. From Dashboards to Agent-Driven Querying Traditional analytics workflows are structured around predefined models: dashboards, semantic layers, and curated datasets. A user question is usually translated into one of these prebuilt views. If the question does not fit, someone writes SQL or updates the dashboard. Agent-based systems invert this flow. Instead of forcing questions into predefined structures, they attempt to generate the structure dynamically. At a high level, the flow looks like this: A user submits a natural language questionThe system enriches the prompt with schema and access contextA model generates SQL or an execution planThe query runs inside the warehouseResults are returned in a structured form or visualized output The key difference from earlier “text-to-SQL” experiments is that steps two and three are tightly grounded in the database context. The model is not guessing a schema from generic training data. It is being provided with actual table definitions, column descriptions, and sometimes usage statistics. This context injection is what makes the system usable in production. Without it, SQL generation tends to fail in subtle ways: incorrect joins, wrong aggregations, or hallucinated columns. Agent Architecture Inside the Warehouse A practical implementation of an analytics agent inside a warehouse typically has three layers. 1. Context and Permission Layer Before any model is called, the system resolves what the user is allowed to see. This includes: Role-based access controlRow-level filtersColumn masking rulesAvailable schemas and tables This step is often underestimated, but it is what makes the system safe enough for real usage. Without it, natural language becomes a bypass mechanism for data access control. 2. Language Model Translation Layer Once context is assembled, the prompt is passed into an LLM hosted within the data platform. In Snowflake’s case, this is handled through Cortex services, but the pattern is not unique to any vendor. The model’s job is not just to produce SQL but to produce SQL that is: Syntactically validAligned with schema constraintsConsistent with security rulesOptimized for warehouse execution patterns For example, a question like, “Show top 10 products by revenue in Q1 2024 grouped by region,” might become: SQL SELECT region, product_name, SUM(revenue) AS total_revenue FROM sales.fact_sales WHERE transaction_date >= '2024-01-01' AND transaction_date < '2024-04-01' GROUP BY region, product_name ORDER BY total_revenue DESC LIMIT 10; The challenge here is not generating SQL that looks correct, but ensuring it respects business definitions. Revenue, for example, might need to be net of returns or adjusted for currency conversion, depending on the organization. 3. Execution and Governance Layer Once SQL is generated, it is executed inside the warehouse engine. This is where the architecture becomes important: nothing leaves the system. The same security policies that apply to human-written queries apply here as well. Because execution happens inside the warehouse, audit logs remain consistent. Every agent action can be traced as a query event, which is important for compliance-heavy environments. Why Snowpark Matters in This Setup Tools like Snowpark extend this model beyond SQL generation. Instead of limiting the agent to query rewriting, Snowpark allows it to execute Python-based logic directly next to the data. This becomes useful when the question is not purely relational. For example: “Forecast next month’s sales for product X.” A simple SQL query cannot answer this. The agent can instead generate a Snowpark Python job that: Extracts historical time series dataConverts it into a DataFrameApplies a forecasting model such as ARIMA or ProphetWrites predictions back into a table The important point is that the data is never exported to an external notebook environment. The compute moves to the data, not the other way around. This pattern also applies to machine learning inference. Pretrained models can be registered as user-defined functions, and the agent can call them like regular SQL functions: SQL SELECT feedback_text, predict_sentiment(feedback_text) AS sentiment_score FROM customer_feedback; From a systems perspective, the agent becomes a planner rather than just a translator. It decides whether SQL is sufficient or whether a Python-based workflow is required. The Streamlit Layer: Turning Queries Into Applications While the warehouse handles computation and the agent handles reasoning, users still need an interface. One of the simpler ways to build this layer is with Streamlit. Streamlit is often used because it reduces the overhead of building internal analytics tools. Instead of designing full frontend systems, teams can wrap agent logic into lightweight interactive apps. A minimal pattern looks like this: Python import streamlit as st st.title("Data Agent Interface") query = st.text_input("Ask a question about your data") if query: result = run_agent(query) st.subheader("Generated SQL") st.code(result["sql"]) st.subheader("Results") st.dataframe(result["data"]) In more mature setups, the Streamlit layer becomes more than a query box. It evolves into a dynamic dashboard generator: Charts are generated based on query intentFilters are derived from schema metadataResults can be saved into reusable viewsUsers can refine queries conversationally This reduces dependency on static dashboards, which are often slow to update and hard to maintain. Governance Is the Real Constraint, Not AI Capability A common misconception is that the main challenge in building these systems is model accuracy. In practice, governance is the harder problem. Three constraints usually define whether an agent system is viable: Data access control must remain intact. Natural language cannot become a bypass layer for restricted data.Query cost must be predictable. Poorly generated queries can become expensive quickly in large warehouses.Results must be reproducible. Two identical questions should not produce different interpretations unless the underlying data changes. Warehouse-native architectures help with this because they centralize execution. There is no separate “AI data layer” that can drift from governance rules. Limitations of the Current Approach Despite progress, these systems are not fully autonomous analytics engines. There are still recurring issues: Ambiguous business definitions lead to incorrect aggregationsComplex joins across poorly modeled schemas still fail frequentlyLLMs may overgeneralize metrics like revenue or churnLatency increases when multi-step reasoning is required In practice, most teams deploy agents as assistants rather than replacements for BI systems. They are good at exploration, not final reporting. Closing Thoughts What is emerging is not a replacement for SQL or dashboards, but a new interface layer on top of them. Natural language becomes a routing mechanism that decides how to query or compute over structured data. The interesting architectural shift is that the intelligence layer is moving closer to the data itself. Whether implemented in Snowflake or other platforms, the pattern is consistent: context-aware models, governed execution, and embedded compute through tools like Snowpark and Cortex. Streamlit or similar tools then complete the stack by providing a lightweight interface that can evolve from simple query boxes into full analytical applications. The result is not “analytics without SQL” but something more realistic: analytics where SQL is still present, but no longer the only way in. More

From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint

By Jitendra Bafna

Artificial intelligence is no longer experimental. It has become an important part of how organizations are building AI agents and applications. From chatbots to autonomous systems, companies are rapidly integrating large language models (LLMs) into their workflows to improve efficiency, automate tasks, and enhance user experiences. However, as adoption grows, so does the complexity around managing these systems. Challenges in Scaling LLM Usage When organizations start using multiple LLMs across different teams and applications, several challenges naturally emerge: Lack of visibility into how models are being usedMultiple LLM providers used independently by different teamsGovernance and compliance challengesDifficulty in coordinating multiple AI agentsRising and unpredictable operational costsSecurity and data privacy risks In many cases, different teams begin integrating their own preferred LLMs without a shared strategy. While this approach may work initially, it quickly becomes difficult to manage at scale. This situation often leads to what can be described as AI sprawl, where systems become fragmented, inconsistent, and harder to govern centrally. Over time, this affects not only cost and security but also the overall reliability of AI-driven applications. The Problem: AI at Scale Becomes Hard to Manage When teams directly connect their applications to different LLM providers, a few common issues appear: Each team may choose a different LLM provider based on preference or convenienceCosts become difficult to track across departmentsSensitive or regulated data may be exposed if proper controls are missingThere is no standard governance model across the organizationEach team builds and maintains its own integration logic This leads to duplicated effort, inconsistent implementations, and limited visibility for platform or IT teams responsible for oversight. As a result, organizations struggle to maintain control over their AI ecosystem while still trying to innovate quickly. Introducing a Centralized Approach: AI Gateway To address these challenges, MuleSoft introduces the concept of an AI Gateway as part of the Anypoint Platform. The AI Gateway acts as a centralized control layer for all LLM requests. Instead of applications connecting directly to different LLM providers, they communicate through a single entry point. This gateway helps in: Routing requests to appropriate modelsApplying security and governance policiesTracking usage and cost across teamsMaintaining consistency in AI interactions In simpler terms, it works like a traffic controller for AI requests, ensuring that everything flows in a controlled, secure, and observable manner. Key Capabilities of AI Gateway 1. Unified Access to Multiple LLMs One of the main advantages of an AI Gateway is that it provides a single unified endpoint for accessing multiple LLM providers. Instead of integrating each model separately, developers can connect once and access different models through the same interface. This also makes it easier to switch between providers if needed, without major changes to application logic. 2. Intelligent Routing of Requests Not all AI requests are the same. Some are simple and can be handled by smaller, cost-effective models, while others require more advanced reasoning capabilities. AI Gateway supports intelligent routing based on factors such as: Complexity of the requestCost considerationsPerformance requirements There are generally two types of routing approaches: Model-Based Routing (Static Routing) In this approach, a specific model is assigned to handle a particular type of request. This is useful when the requirements are well-defined. Semantic Routing (Dynamic Routing) Here, the system analyzes the request and automatically decides which model is best suited. This may involve prompt classification or intent detection to improve accuracy and efficiency. This flexibility helps organizations balance performance and cost more effectively. 3. Cost Control and Visibility One of the major concerns with LLM usage is unpredictable cost. Since most models are billed based on token usage, expenses can grow quickly if not monitored properly. AI Gateway provides visibility into: Token usage per applicationUsage across teams and departmentsOverall consumption trends This helps organizations set budgets, monitor spending, and avoid unexpected cost spikes. 4. Built-in Governance and Security Security is a critical aspect when dealing with AI systems, especially when sensitive or regulated data is involved. AI Gateway helps enforce policies such as: Token limits to control usageAuthentication and authorization mechanismsData protection measures like PII masking These controls ensure that data is handled safely and that only authorized requests are processed. It also helps reduce the risk of exposing sensitive information to external systems. 5. Monitoring and Observability Another important capability is visibility into how AI systems are performing in real time. Through the Anypoint Platform, organizations can access: Detailed logs for every requestUsage analytics and trendsDebugging information for troubleshootingCompliance and audit support This level of observability is essential for maintaining reliability in production environments, especially when AI is used in business-critical workflows. What Is an LLM Proxy? An LLM Proxy is a unified interface layer that sits between applications and LLM providers. Instead of integrating directly with multiple APIs, developers interact with a single endpoint, and the proxy handles routing, security, and policy enforcement behind the scenes. This abstraction simplifies development and reduces the complexity of managing multiple integrations. How an LLM Proxy Works A typical flow looks like this: An application or AI agent sends a request to the LLM Proxy endpointThe proxy (running on a gateway such as Flex Gateway) receives the requestPolicies such as authentication, token limits, and data masking are appliedThe request is routed to the most appropriate LLM providerThe response is returned back to the application All of these steps are managed centrally through the platform, which reduces the need for custom logic in each application. High-Level Concept At a high level, the architecture can be thought of as a control layer sitting between users and multiple LLM providers. Instead of point-to-point integrations, all traffic flows through a single managed layer. This simplifies both operational management and long-term scalability. High-Level Architecture LLM Proxy Architecture Conclusion AI adoption is growing rapidly, but managing it across large organizations introduces new challenges. Without proper structure, teams can end up with fragmented integrations, rising costs, and limited visibility into how AI is being used. A centralized approach using an AI Gateway and LLM proxy helps address these issues by: Providing a unified access layer for AI modelsEnforcing governance and security policies consistentlyImproving visibility into usage and costsSupporting scalable and controlled AI adoption This enables organizations to move from isolated AI experiments toward more structured, enterprise-ready AI systems. Take control of your AI ecosystem with MuleSoft AI Gateway, simplifying, securing, and scaling every interaction through the Anypoint Platform. More

Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel

By Erkin Karanlık

From Indicators to Insights: Automating IOC Enrichment Using Python and Threat Feeds

By Krishnaveni Musku

Docker Hardened Images Are Free Now — Here's What You Still Need to Build

By Shamsher Khan

CORE

Kafka and Spark Structured Streaming in Enterprise: The Patterns That Hold Up Under Pressure

I've been running Kafka and Spark Structured Streaming together in production for about five years. Not in demo environments or proof-of-concept projects. In systems processing insurance claims, manufacturing telemetry, and financial transaction data, with SLAs and compliance requirements, and people who call you at 2 AM when things break. There's a version of Kafka plus Spark Structured Streaming that looks elegant in architecture diagrams and falls apart in the first month of production. There's another version that's uglier in places but genuinely reliable. Here is what I've learned about the difference. Getting Checkpointing Right From the Start In my experience, checkpointing is non-negotiable for any streaming job that needs recovery. But checkpointing to local disk, which is the easiest configuration, means your streaming job can't recover from a node failure, only from a process restart. Checkpoint location must be on durable shared storage, ADLS Gen2 or equivalent, from the first day in production. The checkpoint contains the Kafka offsets that have been committed and the state store for stateful operations. Changing either of these, whether by manually deleting the checkpoint or by changing the query name, will reset your consumer offsets. I've seen this happen accidentally twice: once when an engineer thought deleting a stale checkpoint directory was a cleanup operation, and once when a code refactoring changed the query name used as the checkpoint key. Both required manual offset reconstruction from Kafka's own offset storage. Neither was catastrophic, but both were stressful and avoidable. Micro-Batch Sizing for Your Use Case Spark Structured Streaming processes data in micro-batches. The trigger interval controls how often a micro-batch runs. The default, if you don't specify a trigger, is to run a new batch immediately after the previous one completes. This is correct for high-throughput workloads where you want to process data as fast as possible. It's wrong for moderate-throughput workloads where you want predictable latency and manageable file sizes in your output Delta Lake tables. For our manufacturing telemetry pipeline (moderate throughput, near-real-time requirement), we use a 30-second trigger. This produces files of roughly 50 to 100MB in the output Delta table, which is manageable with a nightly compaction job. For our insurance claims pipeline (lower throughput, 5-minute SLA), we use a 2-minute trigger. My rule of thumb: choose a trigger interval that produces output files in the 50 to 500MB range for your throughput. Files significantly smaller than this create compaction debt. Files significantly larger than this create memory pressure during the micro-batch. Python # Trigger interval examples for different workloads # High throughput: process as fast as possible high_throughput_query = df.writeStream \ .trigger(availableNow=True) # Spark 3.3+: process all, then stop # Moderate throughput (manufacturing telemetry): 30-second batches telemetry_query = df.writeStream \ .trigger(processingTime="30 seconds") \ .outputMode("append") \ .format("delta") \ .option("checkpointLocation", checkpoint_path) \ .start(output_path) # Low throughput (insurance claims): 2-minute batches claims_query = df.writeStream \ .trigger(processingTime="2 minutes") \ .outputMode("append") \ .format("delta") \ .option("checkpointLocation", checkpoint_path) \ .start(output_path) Kafka Partition Count and Spark Parallelism Each Kafka partition is consumed by one Spark task per micro-batch. If your topic has 8 partitions, Spark uses 8 tasks for the Kafka read stage. If your downstream processing is more CPU-intensive than the Kafka read, you'll want more parallelism downstream. Use repartition() after the Kafka source read to increase parallelism for the heavy processing stages. In the other direction: if your Kafka topic has 200 partitions because it was sized for high throughput, but your Spark cluster has 32 cores, you're trying to run 200 tasks across 32 cores with significant context switching overhead. Consider whether the partition count on the topic is appropriate for your actual throughput. Stateful Operations and Watermarks Windowed aggregations and stream-stream joins require Spark to maintain state across micro-batches. Without a watermark, Spark will accumulate state indefinitely, and your executor memory will grow without bound until the job fails. Always define a watermark on your event-time column for any stateful operation. The watermark threshold is a business decision as much as a technical one. A 10-minute watermark means Spark will discard events that arrive more than 10 minutes after the event time they are associated with. If your source systems can deliver events up to 30 minutes late (common in some IoT and batch-settlement scenarios), a 10-minute watermark will cause late events to be silently dropped. Understand your source latency characteristics before setting the watermark. Python # Watermark definition for late-arriving events from pyspark.sql.functions import from_json, col, window claims_parsed = kafka_df \ .select(from_json(col("value").cast("string"), claims_schema) .alias("data")) \ .select("data.*") \ .withWatermark("event_timestamp", "30 minutes") # Windowed aggregation with watermark claims_hourly = claims_parsed \ .groupBy( window("event_timestamp", "1 hour"), "claim_type", "region" ) \ .agg( count("claim_id").alias("claim_count"), sum("claim_amount").alias("total_amount"), avg("claim_amount").alias("avg_amount") ) The Monitoring You Need on Day One First up: consumer lag per partition. This is the most important streaming metric. Growing lag means your consumer can't keep up with producer throughput, and your latency SLA is in jeopardy. Second: micro-batch duration. If micro-batch duration exceeds your trigger interval, you have a processing bottleneck. The job is trying to run continuously without keeping up. Third: state store size for stateful operations. A growing state store is a memory leak waiting to become an OOM failure. My team emits these three metrics from every streaming job to Azure Monitor. When any of them crosses a threshold, we get an alert before users notice a problem. Setting this up properly at deployment time, not after the first production incident, has saved us from several avoidable outages. Python # Azure Monitor metrics emission from Spark streaming from pyspark.sql.streaming import StreamingQueryListener from opencensus.ext.azure import metrics_exporter class StreamingMetricsListener(StreamingQueryListener): def __init__(self, app_insights_key): self.exporter = metrics_exporter.new_metrics_exporter( connection_string=f"InstrumentationKey={app_insights_key}") def onQueryProgress(self, event): p = event.progress self.emit("consumer_lag", p.sources[0].endOffset - p.sources[0].startOffset) self.emit("batch_duration_ms", p.batchDuration) self.emit("state_store_rows", p.stateOperators[0].numRowsTotal if p.stateOperators else 0) def emit(self, name, value): # Send to Azure Monitor / Application Insights self.exporter.export_metrics([{ "name": f"streaming.{name}", "value": value, "timestamp": datetime.utcnow() }]) spark.streams.addListener(StreamingMetricsListener(AI_KEY))

By Kuladeep Sandra

Programmatic Brand Extraction: Pulling Logos, Colors, and Assets from Any URL

Here’s a problem I kept running into: I need a company’s brand assets — their logo, their colors, maybe a hero image — and there’s no API for it. You’re building a white-label dashboard. Or a proposal generator. Or an integration that sends branded emails on behalf of customers. Every time, you end up on their website, right-clicking “Inspect Element,” eyedropping hex codes, and downloading a pixelated PNG from their footer. It’s tedious, it breaks when they redesign, and it doesn’t scale. So I built OpenBrand, an open-source library that extracts brand assets from any URL. Give it a website, get back structured JSON with logos, colors, and backdrop images. No API key needed if you run it as a library. The Problem Is Harder Than It Looks You might think: “Just scrape the <link rel='icon'> and call it a day.” But favicons are 16x16 pixels. That’s not a logo — that’s a logo for ants. Real brand extraction needs to handle: Logo detection. Companies put their logos in wildly different places. Some use an <svg> in the header. Some use a <img> with a class like .site-logo or .brand. Some only have it as an Open Graph image in their <meta> tags. Some have it nowhere obvious, and you need to check their favicon manifest for higher-resolution variants. Color extraction. The brand’s primary color might be in CSS custom properties (--brand-primary), in computed styles on key elements, in their stylesheet as the most-used non-white/non-black color, or embedded in their logo SVG. And you need to distinguish between “the brand color” and “the color they use for body text.” Backdrop images. Hero images, background gradients, Open Graph images — these are useful for building branded experiences, but they’re scattered across different DOM locations and meta tags. The point is: there’s no standard for where brands put their assets. Every website is a snowflake. How OpenBrand Works OpenBrand uses server-side HTML scraping with Cheerio and image analysis with Sharp. No headless browser, no Puppeteer — just direct HTTP requests and intelligent heuristics. Here’s the approach: JavaScript // Fetch the page HTML with a browser-like User-Agent const html = await fetch('https://stripe.com', { headers: { 'User-Agent': 'Mozilla/5.0 ...' } }).then(r => r.text()); // Parse with Cheerio (jQuery-like DOM API for Node.js) const $ = cheerio.load(html); // Run extraction heuristics across the parsed markup For sites that block direct requests, it falls back to Jina Reader, a service that renders pages and returns clean content. The extraction pipeline runs in this order: Logos – Check <svg> elements in header/nav, <img> elements with logo-related classes/IDs, <link rel="icon"> manifest for high-res variants, Open Graph/Twitter card images as fallbackColors – Extract theme-color meta tags, parse manifest.json, sample dominant colors from logo images using SharpBackdrops – Find Open Graph images, hero/banner images, background images on key sections The library returns structured data: TypeScript import { extractBrandAssets } from "openbrand"; const result = await extractBrandAssets("https://stripe.com"); if (result.ok) { console.log(result.data.brand_name); // "Stripe" console.log(result.data.logos); // LogoAsset[] - SVGs, PNGs with URLs and dimensions console.log(result.data.colors); // ColorAsset[] - hex values with context console.log(result.data.backdrop_images); // BackdropAsset[] - hero images, backgrounds } Three Ways to Use It As an npm package (no API key, runs on your server): Shell npm add openbrand TypeScript import { extractBrandAssets } from "openbrand"; const result = await extractBrandAssets("https://linear.app"); Lightweight and fast — no browser process to manage. Good for build scripts, CI pipelines, serverless functions, or backend services. As an API (free API key from openbrand.sh): Shell curl "https://openbrand.sh/api/extract?url=https://stripe.com" \ -H "Authorization: Bearer your_api_key" Good for client-side apps or anywhere you want a simple HTTP call. As an agent skill (for Claude Code, Cursor, Codex, Gemini CLI): Shell npx skills add ethanjyx/openbrand Then just ask your AI agent: “Extract brand assets from linear.app.” This is probably the most interesting distribution channel — 40+ AI coding agents can use it as a tool. What I Got Wrong (And What I’d Do Differently) Some honest takes on the tradeoffs: Static HTML has limits. We don’t execute JavaScript, which means heavily SPA-dependent sites may not expose all their brand assets in the initial HTML. In practice, this matters less than you’d think - logos, favicons, OG tags, and most brand-relevant markup live in static HTML. For the few sites where it fails, the Jina Reader fallback helps. We chose speed and simplicity over completeness. Logo detection is fuzzy. There’s no semantic HTML tag for “this is the company’s logo.” Heuristics work well for ~85% of sites but break on unusual layouts. Some sites put their logo in a <div> with a background image. Some use CSS mask-image. The current approach has a priority-ranked list of strategies, but it’s not perfect. Color extraction conflates brand color with design system color. A company might use blue as its brand color but green for its primary CTA buttons. OpenBrand currently returns both without distinguishing between them. This is a known limitation - brand identity and UI design tokens overlap but aren’t identical. Rate limiting. If you’re extracting from many URLs, you need to be respectful. The API has rate limits built in, but the npm package doesn’t throttle — that’s your responsibility. Where This Is Actually Useful Real use cases I’ve seen or built: White-label SaaS: Automatically theme a customer’s dashboard using their brand colors on first loginProposal/invoice generators: Pull the client’s logo and colors to brand documents without asking them to upload assetsCompetitive analysis tools: Track how competitors’ branding evolves over timeAI agents: Give LLMs the ability to “see” a brand without manual configuration — useful for generating branded content, emails, or presentationsDesign system bootstrapping: Start a new project by extracting the brand’s existing visual language Try It The repo is at github.com/ethanjyx/openbrand. MIT licensed. The fastest way to see if it works for your use case: Shell npm add openbrand node -e " import('openbrand').then(async ({extractBrandAssets}) => { const r = await extractBrandAssets('https://your-target-site.com'); if (r.ok) console.log(JSON.stringify(r.data, null, 2)); else console.error(r.error); }); " If you find sites where the extraction breaks, open an issue — the heuristics improve with every edge case.

By Yixing Jiang

Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me

My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right. We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us. Why Two Tools? Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months. Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately. Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview. The Purview Setup Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development. Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset. Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss. Python # Purview scan configuration (REST API) import requests def create_purview_scan(account_name, collection, data_source): url = (f"https://{account_name}.purview.azure.com/scan/datasources/" f"{data_source}/scans/daily-production-scan") body = { "kind": "AzureStorageMsi", "properties": { "scanRulesetName": "custom-pii-ruleset", "scanRulesetType": "Custom", "collection": {"referenceName": collection}, "credential": { "referenceName": "managed-identity", "credentialType": "ManagedIdentity" } }, "trigger": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2024-01-01T02:00:00Z", "timezone": "UTC" } } } resp = requests.put(url, json=body, headers=get_auth_headers()) return resp.json() # Custom classifier for internal account numbers custom_classifier = { "kind": "Custom", "properties": { "classificationName": "INTERNAL_ACCOUNT_NUMBER", "description": "Internal 12-digit account number format", "classificationRule": { "kind": "Regex", "pattern": "^ACC[0-9]{9}$", "minimumPercentageMatch": 75 } } } The Collibra Integration We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views. Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance. Python # Nightly Purview-to-Collibra metadata sync (Python) import requests from datetime import datetime def sync_purview_to_collibra(purview_client, collibra_client): """Sync technical metadata from Purview to Collibra assets.""" # Fetch all cataloged assets from Purview purview_assets = purview_client.discovery.query( keywords="*", filter={"and": [ {"entityType": "azure_datalake_gen2_path"}, {"classification": ["confidential", "restricted"]} ]}, limit=1000 ) for asset in purview_assets['value']: collibra_asset = collibra_client.find_or_create_asset( name=asset['name'], domain="Data Lake Assets", type="Data Set" ) # Sync technical metadata as attributes collibra_client.update_attributes(collibra_asset['id'], { "Technical Schema": asset.get('schema', ''), "Data Classification": asset.get('classification', []), "Purview Asset Link": asset['id'], "Last Scanned": asset.get('lastScanTime', ''), "Lineage Summary": get_lineage_summary( purview_client, asset['id']), "Sync Timestamp": datetime.utcnow().isoformat() }) return {"synced": len(purview_assets['value']), "timestamp": datetime.utcnow().isoformat()} What Adoption Looked Like Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it. Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change. Python # Collibra asset mapping schema for access request workflow { "asset_type": "Data Set", "domain": "Data Lake Assets", "attributes": { "Technical Name": {"type": "text", "source": "purview"}, "Business Name": {"type": "text", "source": "steward"}, "Data Classification": { "type": "single_select", "values": ["public", "internal", "confidential", "restricted"], "source": "purview" }, "Owner Team": {"type": "text", "source": "steward"}, "PII Columns": {"type": "multi_select", "source": "purview"}, "Quality Certification": { "type": "single_select", "values": ["certified", "provisional", "uncertified"], "source": "steward" }, "Access Request URL": { "type": "url", "template": "https://collibra.internal/access/{asset_id}" } }, "workflow": { "access_request": { "approvers": ["asset_owner", "data_governance_lead"], "sla_hours": 48, "auto_revoke_days": 365 } } } The Honest Caveat A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned. Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.

By Kuladeep Sandra

Exactly-Once Processing: Myth vs Reality

Exactly-once processing (EOP) is often touted as the gold standard for reliability in distributed systems. The promise of processing each message just once seems perfect, whether you're developing financial systems, real-time analytics pipelines, or event-driven microservices. But the truth is much more complex. What most systems refer to as "exactly once" is actually an approximation that balances trade-offs, limitations, and assumptions rather than an absolute. To help you create systems that are accurate, robust, and practical, this article dissects exactly once processing. The Myth The myth is straightforward: The entire program must be executed exactly once if the broker supports exactly once semantics. Similarly, it is frequently assumed that the entire pipeline ensures exactly-once processing if a stream processor can restore state from checkpoints. Another widespread misconception is that, the user-visible result must likewise have happened exactly once if a database transaction commits only once. When analyzed throughout a whole, real-world system, none of these claims actually hold. The Reality This is a more truthful statement: A broker can provide exactly-once writes between producers and broker-managed topics. Within its checkpoint boundary, a stream processor can provide exactly-once state transitions. One local transaction can be made atomic by a database. When idempotency and replay-safety are implemented across all boundaries, an end-to-end business workflow is effective only once. That goes beyond semantics. It is designed for failure. Three Important Questions Three distinct questions should be asked in a useful design review: Is it possible to send the same input more than once? Is it possible to try the same calculation more than once? Is it possible for a business effect to occur more than once? The approach is typically effective if the first two questions are answered affirmatively, but the third is answered negatively. Definition Before debating "exactly once," precisely clarify the meanings at each layer: SemanticDefinitionAt-most-onceA message may be lost, but once acknowledged it is not retried. No duplicates, possible data loss.At-least-onceA message is retried until acknowledged. No data loss, but duplicates are possible.Exactly-once within a boundaryA subsystem guarantees one logical input results in one durable state transition relative to its recovery model.Effectively-onceDuplicate deliveries may occur, but the final business effect is idempotent, so users observe the action exactly once. The first hard fact is that exactly once is nearly always local rather than global. Where The Boundaries Actually Break Local exactly-once guarantees stop at subsystem boundaries unless the next hop is also replay-safe. Acknowledgment, retry, timeout, crash recovery, replay, and deduplication are concepts unique to each boundary. It is not a given that a strong assurance at one boundary would automatically compose with the next. BoundaryMechanismGuarantee LevelFailure RiskProducer to BrokerProducer ID + sequence numbersIdempotent appendLimited to broker partitionsBroker to ConsumerConsumer group offsetsAt-least-once deliveryCrash leads to replayConsumer to DBACID transaction + unique keyIdempotent sink writeNo dedupe causes duplicatesDB to OutboxSame ACID transactionAtomic state + eventRelay may republishOutbox to Downstream BrokerEvent ID deduplication by consumerAt-least-once publishConsumer must handle dedupeService to External APIIdempotency key per requestEffectively-onceTimeout creates uncertainty Every boundary has its own notion of acknowledgment, retry, timeout, crash recovery, replay, and deduplication. Why Are Duplicates Common Rather Than Unusual Recovery behavior typically results in duplicates: After a transmit timeout, a producer attempts again because it cannot distinguish between failure and success. Before the offset acknowledgment, a customer writes to a sink and crashes. A processor reruns a portion of the input after restoring from a checkpoint. Because the publish acknowledgment was not observed, a relay republishes an outbox record. Even though the remote service has already applied the request, an HTTP client attempts again after the timeout. None of those actions is strange. They are precisely what well-behaved distributed software accomplishes under uncertainty. Technical Reality at Each Layer Producer Semantics At the producer layer, exactly once usually means idempotent append behavior to the broker, not magical downstream correctness. For a broker such as Kafka, an idempotent producer can use: Producer id Sequence numbers Partition-level ordering Broker-side dedupe of retried sends A minimal producer configuration seems like: Java enable.idempotence=true acks=all retries=Integer.MAX_VALUE max.in.flight.requests.per.connection=5 This is important. It stops a common type of duplicate appends caused by producer retries. However, it only discusses records that were written to Kafka partitions. It makes no mention of: A consumer writing twice to PostgreSQL A webhook firing twice An email being sent twice Kafka Producer Details: Idempotence Versus Transactions Kafka provides you with two distinct tools that are frequently mistakenly combined into a single idea: Idempotent producer semantics Transactional producers Semantics Retried appends to a partition benefit from idempotence. Multiple writes and offset commits can be combined into a single broker-visible unit with the aid of transactions. The following are typical transactional producer settings: Java enable.idempotence=true transactional.id=orders-service-1 acks=all And the application flow is conceptually: Java producer.initTransactions(); producer.beginTransaction(); producer.send(outputRecord1); producer.send(outputRecord2); producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata); producer.commitTransaction(); This resolves the issue: output records and ingested offsets can become atomically visible simultaneously within Kafka if the application consumes from Kafka and produces back to Kafka. But take note of what's absent: No database write is included No webhook is included No Redis mutation is included No email is included Kafka transactions are genuine, but their scope is Kafka-managed state rather than arbitrary external state. Broker and Consumer Semantics When the customer's progress marker is behind the durable write, it has already completed, the broker may redeliver a record. The classic race is: Consumer reads message Consumer writes to sink Process crashes before offset commit Consumer restarts Broker redelivers The business effect occurs twice if the sink write is not idempotent. For this reason, "broker exactly once" and "application exactly once" are not interchangeable terms. Kafka Consumer Details: Offsets Are Not Business State The consumer offset as evidence that the business activity occurred exactly once is the root cause of many application issues. It isn't. The offset is only evidence of the consumer group's advancement. These are distinct facts: The broker knows a record was consumed by a consumer group The application knows a sink mutation was committed The business knows the effect should be visible once Only when the application specifically aligns those facts will they do so. Additionally, "isolation.level=read_committed" is frequently misinterpreted on the read side. It stops customers from seeing transactional writes that Kafka producers have aborted. Although that is helpful, not all downstream effects are transformed into exactly one. It simply means: Committed Kafka transactions are visible Aborted Kafka transactions are hidden It does not mean: Downstream database writes cannot duplicate External side effects cannot duplicate Application retries are automatically idempotent Because of this, consumer accuracy nearly always necessitates one of: Idempotent sink keys Inbox dedupe tables Sink-side upserts Compare-and-set version checks Business-key uniqueness constraints The Honest Way to Document Guarantees Rather than writing "the system is exactly once," use scoped statements for each boundary in your documentation: BoundaryGuarantee StatementProducer to KafkaProducer writes to Kafka are idempotent per partition.Processor StateState transitions are exactly-once relative to checkpoint recovery.Primary Database WritesWrites are idempotent by business key and message ID.Outbox PublicationPublication is at-least-once; downstream consumers must dedupe by event ID.Webhook / SaaS APIsIntegrations rely on idempotency keys and replay-safe receivers.End-to-End BehaviorBusiness behavior is effectively-once under replay and recovery. Final Thoughts If you make the promise specific and local, exactly once processing is not a fiction. Within clearly specified constraints, databases, Kafka, Flink, and other subsystems can unquestionably offer exactly-once guarantees. The misconception is that the local assurances automatically combine to form a single end-to-end promise that spans brokers, processors, databases, outboxes, caches, webhooks, and external APIs. The robust architecture in real systems is typically: At-least-once delivery Transactional local state Idempotent sinks Outbox or inbox patterns Idempotency keys for side effects Explicit documentation of every guarantee boundary It is the version that endures node draining, broker failovers, replay after restore, and unclear network timeouts, but it is less commercially viable than stating "exactly once everywhere."

By Irullappan irulandi

Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables

Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.

By Seshendranath Balla Venkata

Rethinking Java CRUDs With Event Sourcing and CQRS Patterns

Traditional CRUD systems store only the current state of an entity. When a record is updated, the previous value is overwritten and lost forever. Event Sourcing inverts this model: instead of persisting state, the system persists the sequence of events that caused each state transition. The current state is never stored directly, but it is always derived by replaying the event history. Command Query Responsibility Segregation (CQRS) separates the write model from the read model. A command expresses intent to change state, for example PlaceOrder, AddItem, ShipOrder. A query reads state without modifying it. The two sides use separate models, separate logic, and, in a full implementation, separate storage. CQRS and event sourcing are complementary: the event stream is the write side's source of truth, while one or more projections (read models) are derived from those events for fast querying. This article aims at showing how to apply, in practice, these concepts, using for illustration purposes a modified version of one of Markus Eisele's article, from the 27th of December 2025 on Substack. In his article, Markus shows a Quarkus-based project implementing a simplified order management system. Here, I'm presenting the Spring Boot implementation of this same system, to change. You can find it here. Terminology In a *classical* order management system, by analyzing the associated data model, we can gather a lot of information about orders and their flow in the organization. But while we would be able to account for any order's current status, the data and the data model analysis wouldn't allow us to reconstitute the story of how each order got to its current state. Event Sourcing The event sourcing pattern introduces the dimension of time into the data model. Instead of a schema reflecting the orders' current state, an event-sourcing-based system persists events documenting every change in the orders' lifecycle. Then, by querying these events, we can reconstitute the whole story of a given order, or any other general aggregate, from its initial creation until its current status. CQRS The only problem here is that querying a single aggregate instance event story at a time doesn't allow us to retrieve and consolidate data relative to other aggregates in the data model. Hence, the CQRS pattern is closely related to the event sourcing one, designed to provide the possibility of materializing projected models into logical data structures, reliable enough to support flexible querying options. Commands CQRS dedicates commands to execute operations that modify the system state. The command-based execution model is then the only one able to implement business logic, to validate rules, and to enforce invariants. Projections The system can define as many models as required to provide data to users or to other systems. Thus, a read model is a fast, denormalized, and pre-cached projection containing read-only data that the application needs to answer queries. The system project changes from the command execution model to all its read models. The projection notion is similar to that of a materialized view in relational databases, meaning that whenever the source tables are updated, the changes have to be reflected in all the read model views. Model Segregation In a CQRS architecture, the responsibilities of the system's models are segregated according to their type. A command can only operate on its own execution model, while a query cannot directly modify any of the system's persisted state. A Use Case The use case presented here is a *true* CQRS implementation (not just a naming convention) because: The write path never reads from the read model. CommandHandler reconstructs state exclusively by replaying events from the event store via EventProjection.replayEvents(). It never touches OrderReadModel or OrderRepository.The read path never touches the event store. OrderResource.getOrderReadModel() reads directly from the denormalized ORDERS table. It is a pure query with no business logic.There are two physically distinct storage tables: EVENT_STORE (write side) and ORDERS (read side).The read model is a projection, not a view. OrderProjection listens to domain events and rebuilds the read model incrementally. The ORDERS table could be dropped and rebuilt from scratch by replaying the event store.Commands return CommandResult, a sealed type that communicates success or failure without leaking state. The caller must query the read model separately if it needs current state. Let's look now at the project's key implementation details: Modeling State With Records OrderState is a Java record immutable by construction. No setters, no mutation. Every command produces a *new* state object: Java public record OrderState( UUID orderId, String customerEmail, List<OrderLine> items, OrderStatus status, BigDecimal total ) { public static OrderState initial(UUID orderId, String email) { ... } public static OrderState empty() { ... } } OrderLine is likewise a record with a derived field: Java public record OrderLine(String productName, int quantity, BigDecimal price) { public BigDecimal lineTotal() { return price.multiply(BigDecimal.valueOf(quantity)); } } lineTotal() is a derived record component: it is computed, not stored, demonstrating that records can carry behavior alongside. Events as a Sealed Type Hierarchy OrderEvent is a sealed interface, restricting all permitted implementations to a known, closed set: Java public sealed interface OrderEvent permits OrderEvent.OrderPlaced,OrderEvent.ItemAdded, OrderEvent.ItemRemoved,OrderEvent.OrderCancelled, OrderEvent.OrderShipped { UUID orderId(); OrderState applyTo(OrderState current); record OrderPlaced(UUID orderId, String customerEmail) implements OrderEvent { public OrderState applyTo(OrderState s) { return OrderState.initial(orderId, customerEmail); } } // ... other event types } Using a sealed interface means the compiler enforces exhaustiveness in switchexpressions. Adding a new event type without handling it is a compile error, not a runtime surprise. Each event carries only the data it needs and knows how to apply itself to the current state via applyTo(OrderState). This is the self-describing event pattern. The Fold (Event Replay) A fold, also known as left reduction, is the process of reconstructing state from a list of events over the event stream: Java // EventProjection.java public OrderState replayEvents(List<OrderEvent> events) { return events.stream() .reduce(OrderState.empty(), this::apply, (a, b) -> b); } private OrderState apply(OrderState state, OrderEvent event) { return event.applyTo(state); } OrderState.empty() is the identity element or the seed. Each event is a step function that transforms one immutable state into the next. This is pure functional programming: no side effects, no shared mutable state, entirely deterministic and testable in isolation. Commands as Sealed Records Commands are sealed records grouped in a container interface: Java public sealed interface Command permits Command.PlaceOrderCommand, Command.AddItemCommand,Command.ShipOrderCommand, Command.CancelOrderCommand { record PlaceOrderCommand(String customerEmail) implements Command {} record AddItemCommand(UUID orderId, String productName, int quantity, BigDecimal price) implements Command {} // ... } \Sealed records give commands value semantics (equality by content, toString for free. and type safety (exhaustive pattern matching in the handler). Command Results as Sealed Types CommandResult is a sealed interface expressing all possible outcomes without exceptions: Java public sealed interface CommandResult permits CommandResult.Success, CommandResult.InvalidState, CommandResult.NotFound, CommandResult.ValidationError { record Success(UUID aggregateId) implements CommandResult {} record InvalidState(String message) implements CommandResult {} record NotFound(String message) implements CommandResult {} record ValidationError(String message) implements CommandResult {} } The caller can switch on the result exhaustively. There are no checked exceptions, no nullable returns, and the type system documents all possible failure modes. The Event Store EventStore is the write-side infrastructure. It does two things atomically: Persists the event to EVENT_STORE (JPA via EventRepository).Publishes the event to the Spring application event bus. Java public void append(UUID aggregateId, String aggregateType, OrderEvent event) { int version = nextVersion(aggregateId); String json = objectMapper.writeValueAsString(event); StoredEvent entity = new StoredEvent(aggregateId, aggregateType, version, eventType, json); eventRepository.save(entity); applicationEventPublisher.publishEvent(event); } Versioning provides a lightweight optimistic concurrency guard by preventing concurrent writes from corrupting the stream, based on the unique value aggregateId + version. The Read-Side Projection OrderProjection is a Spring component that listens for domain events and updates the read model: Java @TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT) public void on(OrderEvent event) { OrderReadModel model = orderRepository.findByOrderId(event.orderId()) .orElse(new OrderReadModel()); // update fields from event ... orderRepository.save(model); } @TransactionalEventListener(phase = AFTER_COMMIT) ensures the read model is only updated after the event store transaction commits successfully, preventing this way phantom updates if the write-side transaction rolls back. Running the Application Prerequisites: Java 21, Maven, Docker (for PostgreSQL via TestContainers in tests). Shell # Build and run all tests (requires Docker) ./mvnw clean package # Run the application (requires a running PostgreSQL instance) ./mvnw spring-boot:run # Skip tests ./mvnw clean package -DskipTests API Reference MethodPathDescriptionPOST/ordersPlace a new orderPOST/orders/{id}/itemsAdd an item to an orderPOST/orders/{id}/shipShip an orderPOST/orders/{id}/cancelCancel an orderGET/orders/{id}Reconstruct current state from eventsGET/orders/{id}/eventsRetrieve the full event streamGET/orders/{id}/read-modelRetrieve the denormalized read model Place an order: JSON POST /orders { "customerEmail": "[email protected]" } Add an item: JSON POST /orders/{id}/items { "productName": "Widget", "quantity": 3, "price": 9.99 } Ship an order: JSON POST /orders/{id}/ship { "trackingNumber": "TRACK-001" }

By Nicolas Duminil

CORE

Building Enterprise-Grade Real-Time IoT Dashboards with Vue 3, MQTT, and Kafka

The convergence of IoT, real-time data streaming, and modern frontend frameworks is redefining how engineers build enterprise monitoring systems. Over the course of designing and leading the Device IoT Platform — an enterprise-grade solution for real-time monitoring, configuration, and diagnostics of thousands of distributed network devices — I encountered and solved a core architectural challenge: how do you build a frontend dashboard that can handle hundreds of concurrent device telemetry streams without sacrificing performance, maintainability, or user experience? This article shares the architectural patterns, technology decisions, and hard-won lessons from that journey — covering the full stack from MQTT ingestion to Vue 3 reactivity to Kafka-backed event processing. The Core Problem: Real-Time at Scale Most developers are familiar with polling — periodically fetching data from an API endpoint. For IoT, polling is fundamentally broken: Latency: A 5-second polling interval means 5 seconds of stale state.Inefficiency: You're requesting data even when nothing has changed.Scale: 1,000 devices × 1 request/5s = 200 requests/second just to read status — before any user interaction. The solution is event-driven architecture: devices push telemetry when something changes, and the platform reacts. This requires a rethinking of both backend ingestion and frontend state management. Architecture Overview Plain Text [IoT Devices] | MQTT Broker (Mosquitto / AWS IoT Core) | [Node.js Telemetry Microservice] | [Kafka Topic: device.telemetry.raw] | (stream processor) [Kafka Topic: device.telemetry.enriched] | [WebSocket Server (Node.js)] | [Vue 3 Dashboard Frontend] Each layer has a distinct responsibility: MQTT Broker handles lightweight publish/subscribe with devices using minimal overhead.Node.js microservices bridge MQTT → Kafka, performing initial validation and normalization.Kafka provides durable, replayable event streams — critical for audit trails and late-joining consumers.WebSocket server fans out enriched telemetry to connected dashboard clients in real time.Vue 3 handles reactive rendering, ensuring only the affected UI components re-render when new data arrives. Backend: MQTT → Kafka Bridge in Node.js The heart of the ingestion pipeline is a lightweight Node.js service using the mqtt and kafkajs libraries. Plain Text import mqtt from 'mqtt'; import { Kafka } from 'kafkajs'; const mqttClient = mqtt.connect(process.env.MQTT_BROKER_URL!, { clientId: `telemetry-bridge-${process.pid}`, username: process.env.MQTT_USERNAME, password: process.env.MQTT_PASSWORD, clean: true, }); const kafka = new Kafka({ clientId: 'iot-bridge', brokers: [process.env.KAFKA_BROKER!] }); const producer = kafka.producer(); mqttClient.on('connect', async () => { await producer.connect(); mqttClient.subscribe('devices/+/telemetry', { qos: 1 }); console.log('MQTT → Kafka bridge active'); }); mqttClient.on('message', async (topic, payload) => { const deviceId = topic.split('/')[1]; const data = JSON.parse(payload.toString()); await producer.send({ topic: 'device.telemetry.raw', messages: [ { key: deviceId, value: JSON.stringify({ deviceId, timestamp: Date.now(), ...data }), }, ], }); }); Key design decisions here: QoS Level 1 — ensures at-least-once delivery for telemetry messages. For command acknowledgments, we use QoS 2.Device ID as Kafka partition key — guarantees ordering per device while allowing parallel processing across partitions.Separation of raw vs. enriched topics — the device.telemetry.raw topic contains the bare payload; a downstream stream processor enriches it with device metadata, geolocation, and alert thresholds before publishing to device.telemetry.enriched. WebSocket Fan-Out Server The WebSocket layer subscribes to Kafka's enriched topic and pushes updates to connected browser clients. We use Kafka consumer groups to allow horizontal scaling of the WebSocket tier. Plain Text import { WebSocketServer } from 'ws'; import { Kafka } from 'kafkajs'; const wss = new WebSocketServer({ port: 8080 }); const kafka = new Kafka({ clientId: 'ws-fanout', brokers: [process.env.KAFKA_BROKER!] }); const consumer = kafka.consumer({ groupId: 'websocket-fanout-group' }); // Track subscriptions: deviceId → Set<WebSocket> const deviceSubscriptions = new Map<string, Set<WebSocket>>(); wss.on('connection', (ws) => { ws.on('message', (msg) => { const { action, deviceId } = JSON.parse(msg.toString()); if (action === 'subscribe') { if (!deviceSubscriptions.has(deviceId)) { deviceSubscriptions.set(deviceId, new Set()); } deviceSubscriptions.get(deviceId)!.add(ws); } }); ws.on('close', () => { deviceSubscriptions.forEach((clients) => clients.delete(ws)); }); }); async function startKafkaConsumer() { await consumer.connect(); await consumer.subscribe({ topic: 'device.telemetry.enriched' }); await consumer.run({ eachMessage: async ({ message }) => { const payload = JSON.parse(message.value!.toString()); const clients = deviceSubscriptions.get(payload.deviceId); clients?.forEach((client) => { if (client.readyState === WebSocket.OPEN) { client.send(JSON.stringify(payload)); } }); }, }); } startKafkaConsumer(); This design enables selective subscription — a dashboard user viewing 50 devices only receives telemetry for those 50 devices, not the full firehose. This is critical for performance at scale. Frontend: Vue 3 Reactive Architecture The frontend is built with Vue 3 Composition API + Pinia for state management. The goal is to update only the UI components bound to a specific device when its telemetry arrives — not re-render the entire dashboard. WebSocket Composable Plain Text // composables/useDeviceTelemetry.ts import { ref, onMounted, onUnmounted } from 'vue'; import { useDeviceStore } from '@/stores/deviceStore'; export function useDeviceTelemetry(deviceIds: string[]) { const store = useDeviceStore(); let ws: WebSocket | null = null; const connect = () => { ws = new WebSocket(import.meta.env.VITE_WS_URL); ws.onopen = () => { deviceIds.forEach((id) => { ws!.send(JSON.stringify({ action: 'subscribe', deviceId: id })); }); }; ws.onmessage = (event) => { const telemetry = JSON.parse(event.data); store.updateDeviceTelemetry(telemetry.deviceId, telemetry); }; ws.onclose = () => { // Exponential backoff reconnection setTimeout(connect, Math.min(1000 * 2 ** reconnectAttempts++, 30000)); }; }; onMounted(connect); onUnmounted(() => ws?.close()); } Pinia Store with Fine-Grained Reactivity Plain Text // stores/deviceStore.ts import { defineStore } from 'pinia'; import { reactive } from 'vue'; interface DeviceTelemetry { deviceId: string; status: 'online' | 'offline' | 'degraded'; signalStrength: number; latency: number; lastSeen: number; alerts: string[]; } export const useDeviceStore = defineStore('devices', () => { const telemetryMap = reactive<Record<string, DeviceTelemetry>>({}); function updateDeviceTelemetry(deviceId: string, data: Partial<DeviceTelemetry>) { if (!telemetryMap[deviceId]) { telemetryMap[deviceId] = {} as DeviceTelemetry; } Object.assign(telemetryMap[deviceId], data); } return { telemetryMap, updateDeviceTelemetry }; }); Using reactive() with a map structure means Vue's dependency tracking is at the property level — a component subscribed to telemetryMap['device-001'].signalStrength won't re-render when device-002's data changes. This is the key to dashboard scalability. Device Card Component Plain Text  <template> <div class="device-card" :class="statusClass"> <div class="device-header"> <span class="device-id">{{ deviceId }</span> <StatusBadge :status="telemetry?.status" /> </div> <div class="metrics"> <MetricBar label="Signal" :value="telemetry?.signalStrength" unit="dBm" /> <MetricBar label="Latency" :value="telemetry?.latency" unit="ms" /> </div> <AlertList :alerts="telemetry?.alerts ?? []" /> </div> </template> <script setup lang="ts"> import { computed } from 'vue'; import { useDeviceStore } from '@/stores/deviceStore'; const props = defineProps<{ deviceId: string }>(); const store = useDeviceStore(); // Only this device's slice of state — targeted re-renders only const telemetry = computed(() => store.telemetryMap[props.deviceId]); const statusClass = computed(() => ({ 'status-online': telemetry.value?.status === 'online', 'status-offline': telemetry.value?.status === 'offline', 'status-degraded': telemetry.value?.status === 'degraded', })); </script> Performance Optimizations 1. Virtual Scrolling for Large Device Lists When monitoring 500+ devices, rendering all device cards simultaneously tanks performance. We use vue-virtual-scrollerto only render visible cards: Plain Text <RecycleScroller class="device-list" :items="filteredDevices" :item-size="120" key-field="deviceId" v-slot="{ item }" > <DeviceCard :device-id="item.deviceId" /> </RecycleScroller> 2. Debounced Batch Updates Devices can emit bursts of telemetry. Updating the Pinia store on every single message causes excessive re-renders. We batch incoming messages within a 100ms window: Plain Text let pendingUpdates: Record<string, Partial<DeviceTelemetry>> = {}; let batchTimeout: ReturnType<typeof setTimeout> | null = null; function queueUpdate(deviceId: string, data: Partial<DeviceTelemetry>) { pendingUpdates[deviceId] = { ...(pendingUpdates[deviceId] ?? {}), ...data }; if (!batchTimeout) { batchTimeout = setTimeout(() => { Object.entries(pendingUpdates).forEach(([id, update]) => { store.updateDeviceTelemetry(id, update); }); pendingUpdates = {}; batchTimeout = null; }, 100); } } 3. Lazy Loading and Code Splitting Device diagnostic panels (charts, event logs, configuration editors) are loaded on demand: Plain Text const DeviceDiagnostics = defineAsyncComponent( () => import('@/components/DeviceDiagnostics.vue') ); Combined with route-level code splitting, the initial bundle stays under 200KB gzipped. Security Architecture: OAuth 2.0 + RBAC Device management platforms require fine-grained access control. Not every user should be able to issue firmware update commands to production devices. JWT Claims-Based RBAC We encode role information directly in the JWT access token: Plain Text { "sub": "user-123", "roles": ["device:read", "device:configure"], "scope": "region:us-east", "exp": 1699999999 } The frontend reads these claims to conditionally render action buttons, and the backend validates them on every API call: Plain Text // middleware/rbac.ts export function requirePermission(permission: string) { return (req: Request, res: Response, next: NextFunction) => { const token = req.headers.authorization?.split(' ')[1]; const decoded = verifyJWT(token!); if (!decoded.roles.includes(permission)) { return res.status(403).json({ error: 'Insufficient permissions' }); } next(); }; } // Route definition router.post('/devices/:id/firmware', requirePermission('device:firmware'), handleFirmwareUpdate); Deployment: CI/CD on AWS The entire platform is containerized and deployed via a GitLab CI/CD pipeline to AWS ECS with Fargate. Plain Text # .gitlab-ci.yml (excerpt) stages: - test - build - deploy build-and-push: stage: build script: - docker build -t $ECR_REGISTRY/iot-frontend:$CI_COMMIT_SHA . - docker push $ECR_REGISTRY/iot-frontend:$CI_COMMIT_SHA deploy-production: stage: deploy script: - aws ecs update-service --cluster iot-platform --service frontend --force-new-deployment environment: production only: - main Blue-green deployments ensure zero downtime for this 24/7 critical infrastructure platform. Results and Key Metrics After migrating from a polling-based architecture to this event-driven stack: Dashboard latency: reduced from 5–10 seconds (polling) to under 200ms (WebSocket push).Backend API load: reduced by ~78% — telemetry pushes replaced constant polling.Frontend bundle size: kept under 220KB gzipped through lazy loading and tree-shaking.Throughput: validated at 10,000 concurrent telemetry events/second through Kafka partitioning. Conclusion Building a real-time IoT dashboard at enterprise scale requires rethinking the entire data flow — from device communication protocols through streaming pipelines to fine-grained frontend reactivity. The combination of MQTT for lightweight device communication, Kafka for durable event streaming, WebSockets for real-time push to browsers, and Vue 3's targeted reactivity model creates a system that scales gracefully without sacrificing developer ergonomics. The patterns described here — selective WebSocket subscriptions, batched Pinia updates, virtual scrolling, and JWT-based RBAC — have been validated in production on a platform serving critical network infrastructure. They are broadly applicable to any domain requiring real-time monitoring at scale: energy management, fleet tracking, industrial automation, and beyond. Github: Real-Time-IoT-Dashboards-Vue-3-MQTT-Kafka

By Venkata Sandeep Dhullipalla

One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team

11 Agentic Testing Tools to Know in 2026

Agentic testing tools help teams plan, generate, adapt, and run tests with far less manual effort. They’re quickly becoming part of how modern QA scales without slowing delivery. One thing to get right from the start is scope. Not all agentic testing tools operate at the same level of scope or strategic impact. They vary significantly in what they do and where they fit. Some are point solutions that help you author or run tests faster. Others sit inside broader AI-driven quality platforms that prioritize risk, optimize test portfolios, and enforce quality gates across the pipeline. This post covers 11 agentic testing tools to know about in 2026. They’re grouped so you can compare them based on scope, strengths, and fit for your organization. What Is an Agentic Testing Tool? An agentic testing tool is software that uses AI agents to autonomously plan, generate, maintain, and execute tests. It often makes decisions based on context, such as requirements, code changes, risk signals, or past results. It goes beyond AI-assisted automation by adding initiative and workflow-level decision-making. Instead of only suggesting what to do next, it takes action within defined boundaries. Here are 11 agentic testing tools grouped by scope. Each includes a summary and key strengths and considerations. Let’s go! Enterprise AI-Driven Quality Platforms These platforms extend beyond test creation to orchestrate automation, intelligence, and governance at scale. They are suited for organizations that require stability, risk prioritization, and release confidence across complex environments. 1. Tricentis Tosca Tricentis Tosca is designed for enterprise test automation where stability, scale, and governance matter. In an agentic context, the shift is moving from “write and maintain scripts” to “orchestrate outcomes,” especially across complex apps and high-change environments. Tricentis enables AI-driven testing and agentic quality engineering across your delivery pipeline. It also positions MCP as a way to bridge AI and testing tools through a universal integration approach, which matters if you’re thinking about agentic workflows that span multiple systems. Strengths Suitable for large regression suites and complex end-to-end workflows.AI-assisted resilience helps reduce long-term maintenance costs. Considerations The highest value shows up when teams commit to governance and standardization (not “ad hoc scripts”).Adoption typically requires alignment across QA, engineering, and release stakeholders. 2. SmartBear SmartBear is best viewed as a broad testing portfolio vendor that has been positioning around AI across testing workflows. Strengths Covers multiple testing disciplines.Suitable for consolidated vendor strategies. Considerations AI depth varies across products.Portfolio integration matters. 3. UiPath Test Suite UiPath Test Suite extends testing into broader automation ecosystems. In an agentic context, it is relevant for teams that want testing integrated into AI-driven business process automation and orchestration environments. Strengths Aligns testing with broader automation initiatives.Fits organizations standardizing around enterprise automation platforms. Considerations Strongest value when already invested in the UiPath ecosystem.Organizations must evaluate how deeply autonomous testing workflows integrate with CI/CD. AI-native testing platforms AI-native testing platforms are built with AI at the core of test creation and execution workflows. They aim to reduce friction from requirements to automation and help teams maintain speed and stability as systems evolve. 4. ACCELQ ACCELQ positions itself around AI-powered automation and end-to-end testing acceleration. For agentic buyers, the key question is whether the platform reduces friction from requirements to automation to execution and whether it can keep pace as systems change. Strengths Faster ramp-up for automation.Structured automation workflows. Considerations Like any platform, success depends on fit with your stack and operating model.Ensure governance and explainability are strong enough for enterprise release standards. 5. mabl mabl is an AI-native testing vendor geared toward continuous testing and reducing maintenance overhead. For agentic tool evaluation, focus on whether AI helps you run reliably at speed, not just generate tests during setup. Strengths CI/CD integration.Automation resilience focus. Considerations Primarily web-centric workflows.Enterprise governance depth varies. 6. Functionize Functionize is commonly positioned as AI-forward test automation focused on reducing manual work across authoring, execution, and maintenance. In a practical agentic sense, tools like this aim to do more of the work for you, especially around test upkeep as systems evolve. Strengths Lifecycle focus: value isn’t only authoring, but also keeping tests healthy over time.AI-forward orientation fits teams pushing toward higher autonomy. Considerations Scope depends on team maturity.Organizations may need to evaluate governance needs more deeply. Point-solution agentic tools Point-solution agentic tools focus on solving a specific testing bottleneck rather than managing the full quality lifecycle. They are often used to accelerate test authoring, execution, or UI interaction without requiring a broader platform shift. 7. testRigor testRigor is typically associated with natural-language-driven test creation and reducing scripting complexity. For agentic buyers, it often lands in the “make authoring easier” category. Strengths Lower barrier to authoring.Rapid initial automation. Considerations Primarily focused on UI regression.Potential trade-off between depth and creation speed. 8. QA Wolf QA Wolf is often positioned around fast test creation and managed execution models for teams that want results without building everything in-house. In an agentic tooling conversation, this fits as a way to compress time-to-value, especially when internal bandwidth is limited. Strengths Fast time to coverage.Managed execution support. Considerations The operational model differs from in-house-only tools.Evaluate long-term scaling fit. 9. Virtuoso QA Virtuoso is frequently grouped with AI-led UI testing approaches that aim to reduce manual scripting and increase resilience. Its relevance depends on whether it meaningfully adapts and maintains tests as the app changes, not just how quickly it creates them. Strengths Faster UI automation creation.Reduced scripting complexity. Considerations Validate the reality of flake handling and maintenance in your environment (dynamic UIs expose gaps quickly).Ensure pipeline integration and evidence output meet enterprise needs. 10. AskUI AskUI approaches automation through UI perception and interaction. That can matter when you test across varied front ends, remote desktops, or environments where DOM-level automation is not always feasible. Strengths Useful for UI-driven automation challenges.Works across heterogeneous UI surfaces. Considerations Typically narrower in scope than end-to-end platforms.Validate stability and evidence outputs for long-running regression usage. 11. CoTester by TestGrid CoTester lands in the agentic assistant space for testing workflows. Tools in this category typically let you offload specific tasks, helping your team by generating tests, suggesting validations, or scaling coverage with less effort. Strengths Assistant-style support for testing tasks.Accelerates defined QA activities. Considerations Not a full end-to-end platform.Best as a complementary capability. How Agentic Technology Applies to Modern Testing Agentic testing brings the agent loop into quality workflows. It decides what to test, executes the work, evaluates results, and adjusts based on context. Here’s what that looks like in real delivery pipelines: Planning: Interpreting requirements, code changes, and risk signals to select the right tests.Execution: Running tests and collecting evidence.Adaptation: Repairing brittle selectors and managing flakiness as systems change.Governance: Enforcing quality gates based on measurable signals such as coverage and change impact. Agentic testing is not AI that writes tests. It is AI that runs a quality workflow. How to Choose the Right Agentic Testing Tool Buying decisions usually fail for one of two reasons: teams choose a point tool when they actually need a platform, or they buy a platform when they need quick, targeted relief. Use this checklist to avoid both mistakes. 1. Start With Scope: Assistant, Point Solution, or Platform? Ask one blunt question: Do you need help authoring tests, or do you need help governing release confidence? 2. Demand Measurable Outcomes, Not Demos Demos can look impressive, but real value shows up in production metrics. Look for clear improvements in regression time, maintenance effort, flake rate, defect escapes, and coverage visibility. If success cannot be measured, ROI will be hard to prove. 3. Validate Governance: Explainability, Auditability, Control Agentic systems take action, so your team must understand why. You should be able to explain test selection, recent changes, and the evidence behind a release decision, especially in regulated and enterprise environments. If you want agentic testing that scales beyond a single team or application, you need more than a test generator. You need an AI-driven approach that connects automation, intelligence, and governance. FAQ: Agentic Testing Tools in 2026 What Makes a Testing Tool Truly Agentic? A testing tool is truly agentic if it can independently plan and execute testing actions based on context, such as code changes, requirements, or risk signals. It does not just suggest next steps. It selects tests after a pull request, generates tests from requirements, repairs broken locators, and enforces quality gates with minimal human input. Are Agentic Testing Tools the Same as AI Test Automation? No. AI test automation typically assists with parts of automation, such as smarter locators or faster script creation. Agentic testing tools go further by automating decision-making across workflows. They can decide which tests to run for a build, identify untested code changes, and prioritize high-risk areas without manual triage. What Results Should I Expect From Agentic Testing? Most teams see measurable improvements in regression cycle time and maintenance effort when agentic workflows are implemented correctly. A realistic benchmark is reducing regression runtime by 30–70% through change-based test selection and cutting maintenance effort by 30–50% through self-healing automation and flake reduction.

By Alvin Lee

CORE

Building a Skill-Based Agentic Reviewer with Claude Code: A Practical Guide Using Skills.MD, MCP Servers, Tools, and Tasks

In the evolving landscape of agentic AI development in 2026, combining Anthropic’s open Agent Skills standard with the Model Context Protocol (MCP) enables the creation of highly efficient, portable, and context-aware code reviewers. This article presents a practical, production-ready implementation of a skill-based agentic reviewer tailored for code, pull requests, and technical articles. Leveraging a lightweight SKILL.md file for declarative workflows (with progressive context loading to minimize token usage), parallel sub-agents for specialized checks (security, performance, style, and documentation), and a companion local MCP server exposing deterministic tools (linting, GitHub PR fetching, and vulnerability scanning), the system achieves consistent, auditable, and scalable reviews with minimal manual intervention. The provided architecture and copy-paste code snippets — tested patterns compatible with Claude Code, Cursor, Gemini CLI, and other adopting platforms — demonstrate how to install, customize, and extend the reviewer. Real-world benefits include 3–5× faster review cycles, reduced oversight of AI-generated code, and seamless team sharing via GitHub-hosted skills. This pattern exemplifies the complementary power of Skills (domain expertise and repeatable procedures) and MCP (live tool integration), offering developers a blueprint for building future-proof agentic assistants in modern software engineering workflows. By leveraging Anthropic’s open Agent Skills standard and the Model Context Protocol (MCP), developers can create powerful, context-efficient AI reviewers that automatically trigger structured workflows for code, pull requests, or technical articles. This article provides a complete, production-ready implementation with copy-paste, executable code snippets that you can deploy today in Claude Code, Cursor, or any compatible agent tool. Why Skill-Based Agentic Reviewers Matter in 2026 Traditional LLM prompts for code review are brittle — they bloat the context window and lack consistency. Agent Skills address this through progressive disclosure: only the skill’s name and description reside in the system prompt (~100 tokens). The full workflow loads only when relevant. MCP servers add the “agentic” component — real tools for GitHub API calls, linting, security scanning, or database queries—without custom function-calling glue. Combine them, and you get a reviewer that: Detects review requests automaticallyRuns parallel sub-tasks (security, performance, style)Calls external tools via MCPProduces consistent, auditable reports This pattern powers production teams using Claude Code today and works across Claude Code, Cursor, Gemini CLI, and OpenAI Codex CLI thanks to the open Agent Skills specification. Architecture Overview Plain Text Claude Code (LLM) ├── SKILL.md (workflow + checklists) → loaded on demand ├── MCP Server (tools: lint, github_fetch, security_scan) └── Sub-agents / Tasks (parallel reviewer instances) Skills = recipes (how to review)MCP = kitchen tools (what to review with)Tasks/Sub-agents = parallel execution (agentic scaling) Step 1: Create the Core Skill – agentic-reviewer Create the directory structure (works in ~/.claude/skills/, ~/.cursor/skills/, or any supported tool): Plain Text mkdir -p ~/.claude/skills/agentic-reviewer cd ~/.claude/skills/agentic-reviewer Now create the only required file — SKILL.md: YAML --- name: agentic-reviewer description: > Performs comprehensive agentic reviews of code, PRs, or technical articles. Use when the user says "review", "audit", "check quality", "PR review", "code review", "article review", or uploads files for feedback. Automatically runs security, performance, style, and best-practice checks. Can spawn sub-agents and call MCP tools. version: 1.2 --- Markdown # Agentic Reviewer ## When to Activate - Code files or PRs - Markdown/technical articles - Any request containing "review this", "what's wrong with", or "improve" ## Core Review Workflow (always follow in order) 1. **Understand Context** Identify language/framework, purpose, and user goals. 2. **Static Analysis** Use MCP lint tools if available. 3. **Security & Compliance** Use MCP security scanners. 4. **Performance & Scalability** 5. **Style & Maintainability** Follow team conventions from `references/`. 6. **Suggestions & Refactoring** Provide before/after code. 7. **Summary Report** Include severity levels (Critical/High/Medium/Low). ## Sub-Agent Tasks (spawn when complex) - `security-reviewer`: OWASP Top 10 + secrets scanning - `perf-reviewer`: Big-O, resource usage, caching - `docs-reviewer`: Clarity, examples, diagrams ## Output Format ```markdown ## Agentic Review Report **Overall Score**: XX/100 **Critical Issues**: N **High Issues**: N ### Findings - [ ] Category: Description + evidence + fix ### Recommendations - Code changes (diff format) - MCP tool calls used **Final Verdict**: Approved / Needs Work / Blocked Best Practices Be constructive and specificReference industry standards (e.g., OWASP, Google Java Style)Prioritize issues by business impact How to Activate Plain Text # Restart the agent (Claude Code / Cursor) # Or use: Plain Text /agentic-reviewer review this PR The skill auto-triggers on natural language. Test it by pasting any code snippet into Claude Code. Step 2: Add Deterministic Scripts (Optional but Powerful) Create a simple validator script inside the skill: Plain Text cat > scripts/validate_review.sh << 'EOF' #!/bin/bash # Executable script called from SKILL.md echo "Running automated lint + security baseline..." # Add your own tools here (e.g., eslint, trivy, etc.) EOF chmod +x scripts/validate_review.sh Update SKILL.md: Markdown ## Workflow (updated) 2. **Static Analysis** Run `scripts/validate_review.sh` on provided files. Step 3: Make It Truly Agentic with an MCP Server Skills provide knowledge. MCP provides live tools. Install: Plain Text pip install fastmcp Create reviewer-mcp.py: Python from fastmcp import FastMCP import subprocess mcp = FastMCP("agentic-reviewer-tools") @mcp.tool def run_linter(file_path: str, language: str = "python") -> str: if language == "python": result = subprocess.run(["flake8", file_path], capture_output=True, text=True) return f"Linting results:\n{result.stdout or 'No issues'}" return "Unsupported language" @mcp.tool def github_pr_fetch(pr_url: str) -> str: return f"PR fetched from {pr_url} — diff available for review" @mcp.tool def security_scan(file_path: str) -> str: return "✅ No critical secrets found" if __name__ == "__main__": print("Starting MCP server on http://localhost:8080") mcp.run(port=8080) Run: Plain Text python reviewer-mcp.py Step 4: Parallel Tasks with Sub-Agents Markdown ## Parallel Sub-Agent Tasks When the review is large: - Spawn security-reviewer - Spawn perf-reviewer - Synthesize results in the main agent Testing and Production Tips Test locally: Paste a PR diff and say “run agentic review”Share with team: Plain Text git clone your-repo ~/.claude/skills/agentic-reviewer Distribution: ZIP or publish via marketplaceToken efficiency: ~100 tokens until triggeredVersioning: Bump YAML version for updates Real-World Use Cases PR Reviews: Auto-fetches diff via MCP + runs full checklistTechnical Article Review (InfoQ/ DZone style): Checks clarity, code accuracy, SEO, and technical depthLegacy Code Audit: Spawns 5 sub-agents in parallelOn-call Incident Review: Pulls logs via MCP and applies security skill Teams report 3–5× faster reviews with consistent quality and fewer missed issues. Conclusion and Next Steps The combination of Skills.MD (declarative workflows) + MCP servers (executable tools) + tasks/sub-agents (parallelism) turns Claude Code from a helpful assistant into a production-grade reviewer. Start today: Copy the SKILL.md aboveRun the Python MCP serverWatch Claude automatically become your expert reviewer The Agent Skills ecosystem is exploding — the agentic-reviewer skill you just built is fully portable and future-proof. Resources Official Agent Skills Spec: agentskills.ioFastMCP & MCP servers: mcpservers.orgClaude Code Skills Marketplace (built-in) Happy reviewing — your code (and articles) will thank you.

By Bhaskar Reddy Kollu

Coding

Functions of Coding

Frameworks

Java

JavaScript

Languages

Tools

DZone's Featured Coding Resources

The Latest Coding Topics