Welcome to the Data Engineering category of DZone, where you will find all the information you need for AI/ML, big data, data, databases, and IoT. As you determine the first steps for new systems or reevaluate existing ones, you're going to require tools and resources to gather, store, and analyze data. The Zones within our Data Engineering category contain resources that will help you expertly navigate through the SDLC Analysis stage.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
IoT, or the Internet of Things, is a technological field that makes it possible for users to connect devices and systems and exchange data over the internet. Through DZone's IoT resources, you'll learn about smart devices, sensors, networks, edge computing, and many other technologies — including those that are now part of the average person's daily life.
PostgreSQL Trigram Similarity vs. Pattern Matching: A Performance Comparison
Secure AI Architecture for Payments: From Risk Signals to Real-Time Decisions
Introduction: From Custom-Code Chaos to Connection Harmony For decades, connecting software systems meant one thing: custom code. Every new integration required a new connector, another adapter, and a fresh set of maintenance nightmares. The same pain has now reached the world of AI. Developers are writing endless glue code to make large language models (LLMs) talk to databases, APIs, and business tools, and every integration is slightly different. Enter the Model Context Protocol (MCP), a new open standard that promises to bring order to this chaos. Its philosophy is simple yet powerful: “Build once. Connect anywhere.” Just as APIs standardized how web applications communicate with back-ends and databases, MCP standardizes how AI applications and agents connect to data and tools. In other words, if APIs unlocked the web revolution, MCP may be the key that unlocks the AI revolution. What Is MCP and Why It Matters At its core, MCP (Model Context Protocol) is an open protocol that defines how AI systems can securely and consistently access external data, tools, and APIs. It acts as the transport layer between your AI model and the enterprise world, providing a universal bridge that is structured, secure, and reusable. Modern AI agents increasingly need context beyond what’s in their model weights: CRM records, sales data, sensor readings, document repositories, or even REST endpoints. Without a consistent interface, these integrations become brittle and expensive. MCP eliminates that friction. Think of MCP as doing for AI what HTTP and REST did for the web. It’s not another proprietary plugin system or SDK; it’s a standard communication layer that makes AI interoperable across ecosystems. The Architecture of MCP The beauty of MCP lies in its simplicity. The architecture follows a client–host–server pattern that scales elegantly across applications. MCP Client: The client is where your AI application or agent lives. It can be a chatbot, an AI-powered web app, a mobile assistant, or even a background automation worker. The client sends and receives requests using MCP’s standardized message format.MCP Host: The host acts as the runtime container that carries the client. It’s similar to how a web server hosts a web application, managing lifecycle, permissions, and runtime execution. The host ensures the client adheres to the MCP specification when communicating with servers.MCP Servers: An MCP environment can have multiple servers, each connecting to a distinct external system: A database connector server might retrieve enterprise data from Postgres or BigQuery.An API connector could communicate with Salesforce, ServiceNow, or SAP.A file system server might access document repositories or local files. Each server exposes its data and functionality through the MCP standard, meaning the AI client doesn’t need to know whether it’s talking to SQL, REST, or a CSV file; it just speaks MCP. The Protocol Layer : Between clients and servers sits the MCP protocol layer, the glue that defines how they communicate. This layer handles message routing, schema validation, authentication, and error handling. Because it’s protocol-driven, it ensures that all data exchanges follow a consistent structure, regardless of vendor or platform. The Three Core Interfaces of MCP MCP defines three major interfaces that make this interoperability possible: Tools: These are model-invoked functions that the AI can execute. Think “search the knowledge base,” “send an email,” or “update the database.”Resources: These represent the data objects accessible to the AI: files, API resources, or database records.Prompts: These are user-defined templates or triggers that help guide the AI’s reasoning and responses in context. Together, these interfaces enable the LLM to act, access, and reason through standardized interfaces rather than through hardcoded integrations. How MCP Works: A Real-World Example Let’s make this concrete with an example: an AI-powered travel assistant built on MCP. The MCP Client is your travel booking app; it could be web-, mobile-, or chat-based.Inside the client runs an LLM, which interprets user requests and plans itineraries.The client connects to multiple MCP Servers: One connects to a Flight Booking API to retrieve schedules and fares.Another connects to a Hotel Reservation System.A third links to a Weather API for forecasts.Yet another connects to an Attractions Database for activities. When a user says, “Plan a seven-day family trip to Japan in April with outdoor and cultural activities,” the client sends that request to the LLM. The model analyzes it and identifies the data it needs: flights, hotels, weather, and attractions, then instructs the client to fetch that information. The MCP client makes standardized requests to each server, collects the responses, and packages them in a uniform format. The LLM synthesizes these results and returns a single, coherent itinerary that includes flights, hotels, weather insights, and personalized recommendations. If the user confirms, the same protocol can trigger bookings across MCP without adding a single line of custom glue code. Why MCP Is a Game-Changer Real-Time Data Access: MCP enables AI systems to pull fresh, accurate information from multiple external sources instantly and securely. No more stale embeddings or outdated context windows.Personalization and Context: Because MCP merges live data streams from various sources, it lets AI systems create deeply personalized, context-aware experiences for every user.Comprehensive Automation: An MCP-enabled agent can orchestrate multi-system workflows — flights, hotels, calendars, or finance systems in one seamless process.Adaptability and Extensibility: Need to add restaurant reservations tomorrow? Just plug in a new MCP server. No refactoring, no new SDKs. The protocol handles it all through its standardized schema.Enterprise-Ready Security: MCP is built on principles of permissioned access, data provenance, and auditability critical for regulated enterprises that can’t risk arbitrary data calls. The Future of AI Interoperability MCP represents more than a new protocol; it’s the foundation of context-aware, multi-agent AI ecosystems. As more tools, databases, and APIs adopt MCP, developers will spend less time writing integration glue and more time designing intelligent behavior. Just as APIs once connected the web, MCP will connect the world of AI. Conclusion The Model Context Protocol is redefining how AI agents interact with the world, turning fragmented, custom integrations into clean, scalable, and reusable connections. It's the interoperability standard that will shape the next generation of intelligent systems. So the next time you think about connecting your AI to enterprise data, remember: “Don’t hardcode it. MCP it.”
At a previous company, we hit a point where our PostgreSQL database was clearly becoming a bottleneck. The primary instance was responsible for every user-facing search request, including some heavy queries from long-time users with years of historical data. CPU usage kept climbing, and before long, we were already running on the largest instance size available. Vertical scaling had officially hit its limit. Naturally, the next step was to introduce read replicas. On paper, it sounded straightforward. In reality, it wasn’t. Our application had a strict requirement: read-your-write consistency. If a user created or updated data and immediately searched for it, they expected to see it — no delays, no inconsistencies. Even a few seconds of replication lag led to confusion, support tickets, and a noticeable drop in trust. That constraint ruled out naive load balancing across replicas. In this article, I’ll walk through how we solved this by leveraging PostgreSQL’s write-ahead log (WAL) to make smarter routing decisions between the primary and replicas. The result? We reduced CPU load on the primary by 62%, kept latency predictable, and — most importantly — preserved the consistency guarantees our users relied on. The Problem: When Read Replicas Aren't Enough Our Initial State Our architecture was straightforward: a single PostgreSQL 14 primary database handling all reads and writes for a user-facing application. The numbers told a concerning story: MetricValuePeak QPS (Queries Per Second)~15,000Average Query Latency (p50)45ms99th Percentile Latency890msCPU Utilization (Peak)94%Read/Write Ratio85:15 With 85% of traffic being reads, replicas seemed like the obvious answer. But our domain had a critical constraint. The Consistency Problem Consider this user flow: User creates a new order at T=0User immediately navigates to "My Orders" page at T=2sApplication queries for user's ordersUser expects to see their new order With asynchronous replication, the replica might be 50–500ms behind the primary in the worst case; it could even be in seconds, which might lead to bad CX. If we routed the read to a lagging replica, the user wouldn't see their just-created order. In e-commerce, this creates panic: "Did my order go through? Should I order again?" We needed a solution that would route reads to replicas only when those replicas had caught up to the user's most recent write. Understanding PostgreSQL WAL and LSN Before diving into the implementation, let's understand the PostgreSQL internals that make this possible. Write-Ahead Logging (WAL) PostgreSQL uses write-ahead logging to ensure durability. Every change to the database is first written to the WAL before being applied to the actual data files. This WAL is then streamed to replicas for replication. Log Sequence Number (LSN) The LSN is a pointer to a position in the WAL. It's a 64-bit integer representing the byte offset in the WAL stream. Every transaction that modifies data generates a new LSN. SQL -- Get the current WAL position on primary SELECT pg_current_wal_lsn(); -- Result: 16/3D42A8B0 -- Get the last replayed position on a replica SELECT pg_last_wal_replay_lsn(); -- Result: 16/3D42A7F0 The key insight: if a replica's replay LSN is greater than or equal to the LSN of a user's last write, that replica has all the user's data. The Solution: LSN-Based Replica Routing High-Level Architecture The Algorithm On write: Capture the LSN after committing a user's transactionStore LSN: Associate the LSN with the user (in Redis with TTL)On read: Retrieve the user's last-write LSNRoute: Query replica LSNs and route to one that has caught upFallback: If no replica has caught up, route to primary Implementation Details Step 1: Capturing Write LSN After any write operation, we capture the current WAL LSN: Java @Service public class WriteTrackingService { private final JdbcTemplate jdbcTemplate; private final RedisTemplate<String, String> redisTemplate; private static final String LSN_KEY_PREFIX = "user:lsn:"; private static final Duration LSN_TTL = Duration.ofMinutes(5); @Transactional public void executeWriteAndTrackLsn(String userId, Runnable writeOperation) { // Execute the actual write writeOperation.run(); // Capture the LSN after commit String currentLsn = jdbcTemplate.queryForObject( "SELECT pg_current_wal_lsn()::text", String.class ); // Store in Redis with TTL String key = LSN_KEY_PREFIX + userId; redisTemplate.opsForValue().set(key, currentLsn, LSN_TTL); log.debug("Tracked write LSN {} for user {}", currentLsn, userId); } } Step 2: The Replica Health Monitor We continuously monitor replica LSN positions: Java @Component public class ReplicaLsnMonitor { private final Map<String, DataSource> replicaDataSources; private final ConcurrentHashMap<String, LsnPosition> replicaPositions = new ConcurrentHashMap<>(); @Scheduled(fixedRate = 100) // Poll every 100ms public void updateReplicaPositions() { replicaDataSources.forEach((replicaId, dataSource) -> { try (Connection conn = dataSource.getConnection(); Statement stmt = conn.createStatement(); ResultSet rs = stmt.executeQuery( "SELECT pg_last_wal_replay_lsn()::text, " + "EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::int " + "AS lag_seconds")) { if (rs.next()) { String lsn = rs.getString(1); int lagSeconds = rs.getInt(2); replicaPositions.put(replicaId, new LsnPosition( lsn, lagSeconds, Instant.now() )); } } catch (SQLException e) { log.warn("Failed to get LSN for replica {}: {}", replicaId, e.getMessage()); replicaPositions.remove(replicaId); } }); } public Map<String, LsnPosition> getCurrentPositions() { return new HashMap<>(replicaPositions); } } @Value class LsnPosition { String lsn; int lagSeconds; Instant capturedAt; } Step 3: LSN Comparison Utility PostgreSQL LSNs are in the format X/YYYYYYYY where X is the timeline and Y is the offset. We need to compare them properly: Java public class LsnComparator { /** * Compares two PostgreSQL LSN values. * Returns: negative if lsn1 < lsn2, zero if equal, positive if lsn1 > lsn2 */ public static int compare(String lsn1, String lsn2) { if (lsn1 == null || lsn2 == null) { throw new IllegalArgumentException("LSN cannot be null"); } long[] parsed1 = parseLsn(lsn1); long[] parsed2 = parseLsn(lsn2); // Compare timeline first int timelineCompare = Long.compare(parsed1[0], parsed2[0]); if (timelineCompare != 0) { return timelineCompare; } // Then compare offset return Long.compare(parsed1[1], parsed2[1]); } /** * Returns true if replicaLsn >= requiredLsn */ public static boolean hasCaughtUp(String replicaLsn, String requiredLsn) { return compare(replicaLsn, requiredLsn) >= 0; } private static long[] parseLsn(String lsn) { String[] parts = lsn.split("/"); if (parts.length != 2) { throw new IllegalArgumentException("Invalid LSN format: " + lsn); } return new long[] { Long.parseLong(parts[0], 16), Long.parseLong(parts[1], 16) }; } } Step 4: The Query Router The core routing logic that brings everything together: Java @Service public class ConsistentReadRouter { private final RedisTemplate<String, String> redisTemplate; private final ReplicaLsnMonitor replicaMonitor; private final DataSource primaryDataSource; private final Map<String, DataSource> replicaDataSources; private final MeterRegistry meterRegistry; private static final String LSN_KEY_PREFIX = "user:lsn:"; // Metrics private final Counter routedToPrimary; private final Counter routedToReplica; private final Counter noLsnFound; public ConsistentReadRouter(/* dependencies */) { // Initialize dependencies... this.routedToPrimary = meterRegistry.counter("db.routing.primary"); this.routedToReplica = meterRegistry.counter("db.routing.replica"); this.noLsnFound = meterRegistry.counter("db.routing.no_lsn"); } public DataSource getDataSourceForRead(String userId) { // Step 1: Get user's last write LSN String userLsn = redisTemplate.opsForValue() .get(LSN_KEY_PREFIX + userId); if (userLsn == null) { // No recent writes - any replica is fine noLsnFound.increment(); return selectHealthiestReplica(); } // Step 2: Find a replica that has caught up Map<String, LsnPosition> positions = replicaMonitor.getCurrentPositions(); List<String> eligibleReplicas = positions.entrySet().stream() .filter(e -> LsnComparator.hasCaughtUp(e.getValue().getLsn(), userLsn)) .filter(e -> e.getValue().getLagSeconds() < 30) // Healthy threshold .map(Map.Entry::getKey) .collect(Collectors.toList()); if (eligibleReplicas.isEmpty()) { // No replica has caught up - use primary routedToPrimary.increment(); log.debug("No replica caught up for user {} (LSN: {}), routing to primary", userId, userLsn); return primaryDataSource; } // Step 3: Select replica with lowest lag String selectedReplica = eligibleReplicas.stream() .min(Comparator.comparing(id -> positions.get(id).getLagSeconds())) .orElseThrow(); routedToReplica.increment(); log.debug("Routing user {} to replica {} (user LSN: {}, replica LSN: {})", userId, selectedReplica, userLsn, positions.get(selectedReplica).getLsn()); return replicaDataSources.get(selectedReplica); } private DataSource selectHealthiestReplica() { Map<String, LsnPosition> positions = replicaMonitor.getCurrentPositions(); return positions.entrySet().stream() .filter(e -> e.getValue().getLagSeconds() < 30) .min(Comparator.comparing(e -> e.getValue().getLagSeconds())) .map(e -> replicaDataSources.get(e.getKey())) .orElse(primaryDataSource); } } Step 5: Integration with Repository Layer We created a custom annotation and aspect to make this transparent to developers: Java @Target(ElementType.METHOD) @Retention(RetentionPolicy.RUNTIME) public @interface ConsistentRead { String userIdParam() default "userId"; } @Aspect @Component public class ConsistentReadAspect { private final ConsistentReadRouter router; @Around("@annotation(consistentRead)") public Object routeToAppropriateDataSource( ProceedingJoinPoint joinPoint, ConsistentRead consistentRead) throws Throwable { String userId = extractUserId(joinPoint, consistentRead.userIdParam()); DataSource dataSource = router.getDataSourceForRead(userId); // Set the DataSource in thread-local context DataSourceContextHolder.setDataSource(dataSource); try { return joinPoint.proceed(); } finally { DataSourceContextHolder.clear(); } } private String extractUserId(ProceedingJoinPoint joinPoint, String paramName) { // Extract userId from method parameters MethodSignature signature = (MethodSignature) joinPoint.getSignature(); String[] paramNames = signature.getParameterNames(); Object[] args = joinPoint.getArgs(); for (int i = 0; i < paramNames.length; i++) { if (paramNames[i].equals(paramName)) { return args[i].toString(); } } throw new IllegalStateException("Could not find userId parameter: " + paramName); } } // Usage in repository @Repository public class OrderRepository { @ConsistentRead(userIdParam = "userId") public List<Order> findOrdersByUserId(String userId) { // This query will be routed to an appropriate replica return jdbcTemplate.query( "SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC", orderRowMapper, userId ); } } The Dynamic DataSource Routing To make the routing work with Spring's transaction management: Java public class RoutingDataSource extends AbstractRoutingDataSource { @Override protected Object determineCurrentLookupKey() { return DataSourceContextHolder.getCurrentDataSource(); } } public class DataSourceContextHolder { private static final ThreadLocal<DataSource> contextHolder = new ThreadLocal<>(); public static void setDataSource(DataSource dataSource) { contextHolder.set(dataSource); } public static DataSource getCurrentDataSource() { return contextHolder.get(); } public static void clear() { contextHolder.remove(); } } @Configuration public class DataSourceConfig { @Bean public DataSource routingDataSource( DataSource primaryDataSource, Map<String, DataSource> replicaDataSources) { RoutingDataSource routingDataSource = new RoutingDataSource(); Map<Object, Object> targetDataSources = new HashMap<>(); targetDataSources.put("primary", primaryDataSource); replicaDataSources.forEach(targetDataSources::put); routingDataSource.setTargetDataSources(targetDataSources); routingDataSource.setDefaultTargetDataSource(primaryDataSource); return routingDataSource; } } Handling Edge Cases Edge Case 1: Replica Lag Spikes During deployments or heavy write loads, replicas can fall behind significantly. We handle this with a lag threshold: Java private static final int MAX_ACCEPTABLE_LAG_SECONDS = 30; public boolean isReplicaHealthy(LsnPosition position) { // Check if position data is fresh if (position.getCapturedAt().isBefore(Instant.now().minusSeconds(5))) { return false; // Stale monitoring data } return position.getLagSeconds() < MAX_ACCEPTABLE_LAG_SECONDS; } Edge Case 2: Redis Failures If Redis is unavailable, we fail over to the primary: Java public DataSource getDataSourceForRead(String userId) { String userLsn; try { userLsn = redisTemplate.opsForValue().get(LSN_KEY_PREFIX + userId); } catch (RedisConnectionException e) { log.warn("Redis unavailable, failing safe to primary"); return primaryDataSource; } // ... rest of routing logic } Edge Case 3: User's First Read (No Prior Writes) For users without recent writes, we don't have an LSN requirement and can use any healthy replica: Java if (userLsn == null) { noLsnFound.increment(); return selectHealthiestReplica(); } Edge Case 4: Cross-Session Consistency The TTL on Redis keys (5 minutes) handles the case where users return after a period of inactivity. After 5 minutes, we assume any replica has caught up. Monitoring and Observability Comprehensive monitoring was crucial for understanding system behavior. Key Metrics Java @Component public class RoutingMetrics { private final MeterRegistry registry; // Routing decisions private final Counter routedToPrimary; private final Counter routedToReplica; private final Counter routedNoLsn; // Latency private final Timer routingDecisionTime; // Replica health private final Gauge replicaLagGauge; public RoutingMetrics(MeterRegistry registry, ReplicaLsnMonitor monitor) { this.registry = registry; this.routedToPrimary = registry.counter("db.routing.destination", "target", "primary"); this.routedToReplica = registry.counter("db.routing.destination", "target", "replica"); this.routedNoLsn = registry.counter("db.routing.no_user_lsn"); this.routingDecisionTime = registry.timer("db.routing.decision.time"); // Per-replica lag gauges monitor.getCurrentPositions().keySet().forEach(replicaId -> { registry.gauge("db.replica.lag.seconds", Tags.of("replica", replicaId), monitor, m -> m.getCurrentPositions() .getOrDefault(replicaId, new LsnPosition("0/0", 999, Instant.now())) .getLagSeconds() ); }); } } Grafana Dashboard Queries SQL -- Routing distribution over time SELECT time_bucket('1 minute', time) AS bucket, sum(case when target = 'primary' then 1 else 0 end) as primary_routes, sum(case when target = 'replica' then 1 else 0 end) as replica_routes FROM db_routing_destination GROUP BY bucket ORDER BY bucket; -- Replica lag over time SELECT time_bucket('1 minute', time) AS bucket, replica, avg(lag_seconds) as avg_lag FROM db_replica_lag_seconds GROUP BY bucket, replica ORDER BY bucket; Alerting Rules SQL groups: - name: read-replica-routing rules: - alert: HighPrimaryRoutingRate expr: | sum(rate(db_routing_destination{target="primary"}[5m])) / sum(rate(db_routing_destination[5m])) > 0.3 for: 10m labels: severity: warning annotations: summary: "High primary routing rate ({{ $value | humanizePercentage })" - alert: ReplicaLagHigh expr: db_replica_lag_seconds > 30 for: 5m labels: severity: warning annotations: summary: "Replica {{ $labels.replica } lag is {{ $value }s" Results and Improvements After implementing this solution and rolling it out gradually over two weeks, we observed significant improvements. Performance Metrics MetricBeforeAfterImprovementPrimary CPU (Peak)94%36%62% reductionPrimary CPU (Average)71%28%61% reductionQuery Latency (p50)45ms32ms29% fasterQuery Latency (p99)890ms245ms72% fasterRead QPS on Primary~12,750~1,90085% offloaded Routing Distribution After stabilization, our routing distribution looked like: Routed to Replica: 82%Routed to Primary (caught-up check failed): 11%Routed to Primary (no LSN in Redis): 7% The 11% that went to primary due to consistency requirements represented users actively creating/modifying data — exactly the cases where we needed strong consistency. Cost Savings With reduced primary load, we were able to downsize our primary instance: Before: db.r6g.16xlarge ($6.912/hour)After: db.r6g.4xlarge ($1.728/hour) + 2x db.r6g.2xlarge replicas ($0.864/hour each)Net savings: ~$3,456/hour → ~50% cost reduction Lessons Learned 1. LSN Polling Frequency Matters Initially, we polled replica LSNs every second, which was too slow. Users making quick successive writes would get routed to the primary unnecessarily. Reducing to 100ms polling solved this while adding minimal overhead. 2. Redis TTL Tuning Our initial 30-second TTL was too aggressive. Users navigating slowly through forms would lose their LSN tracking mid-flow. We increased to 5 minutes, which covered 99% of user sessions while still allowing eventual full replica usage. 3. Graceful Degradation is Essential When we had a Redis cluster maintenance window, our initial implementation caused errors. Adding fallback-to-primary logic ensured zero user impact during infrastructure issues. 4. Monitor Routing Decisions, Not Just Performance Understanding why queries went to primary vs. replica was crucial for optimization. The "no LSN found" metric helped us identify user flows where we weren't tracking writes properly. Conclusion In the end, this approach gave us the best of both worlds: strong consistency where it mattered and real cost savings where it didn’t. It’s not the simplest solution, and it’s definitely not something I’d recommend unless you’re already pushing PostgreSQL hard — but if you are, WAL-based routing is surprisingly effective. References PostgreSQL WAL Documentationpg_current_wal_lsn FunctionStreaming Replication
When a new application starts, everything feels simple: features look straightforward, the architecture is clean, and launch day goes smoothly. Then the real work begins — a steady flow of new features, refactors, and bug fixes. You need to ship critical fixes fast without revealing unfinished pages or risky changes to customers. As the scope grows — multiple teams, shared dependencies, and large features — coordination gets harder. This is where feature flags help. Feature flags reduce risk by decoupling deployment from release. Azure App Configuration and Key Vault let you control rollouts safely, target by tenant/segment, and audit changes — without exposing secrets to the browser. In this tutorial, you’ll build a reference implementation that: Runs safe, progressive rollouts in a React/TypeScript SPATargets by tenant or segment, and limits the blast radiusAudits changes and prevents config driftKeeps all secrets server-side (no secrets in the SPA)Supports fast rollback via labels If that’s what you’re after, let’s dive in. Reference Architecture (No Secrets in the SPA) High-Level Flow Browser SPA (React): Never calls Azure App Configuration directly.Edge: Akamai WAF → Azure Application Gateway.Backend “Config Proxy” service (AKS): Exposes GET /flags for the SPA. Authenticated, tenant-aware, read-only. Uses Managed Identity to read App Configuration and, if needed, resolve Key Vault references server-side only.Azure App Configuration: Stores flags, labels (env/version), and references to Key Vault.Azure Key Vault: Holds secrets; never returned to clients.Redis (optional): Short-lived cache for flags to reduce latency and protect App Configuration during spikes.Log Analytics: Audits, metrics, and rollout dashboards. YAML flowchart LR WAF["Akamai WAF"] AG["Azure App Gateway"] subgraph "AKS (cluster)" SPA["SPA (React/TypeScript)"] PROXY["Config Proxy (Node/Express)"] end APPCONF["Azure App Configuration"] KV["Azure Key Vault"] REDIS["Redis (optional)"] LAW["Log Analytics Workspace"] WAF -->|"HTTP(S)"| AG AG -->|"Route: / (static)"| SPA AG -->|"Route: /flags"| PROXY SPA -->|"GET '/flags' (tenant-aware)"| PROXY PROXY -->|"Read flags via Managed Identity"| APPCONF PROXY -->|"Resolve KV ref (server-only)"| KV PROXY -->|"Cache flags (TTL 30s)"| REDIS PROXY -.->|"Cache-Control + Vary 'X-Tenant-Id'"| WAF SPA -.->|"Telemetry (perf/errors)"| LAW PROXY -.->|"Diagnostics/audit logs"| LAWo Core Principles Minimize blast radius: Ship dark, expose to < 5% or a pilot tenant list, then ramp.Never leak secrets: Resolve Key Vault only in backend; SPA receives booleans/strings needed for UI toggles.Single source of truth: Flags defined in Git, synced to App Configuration via CI/CD, with drift detection.Fast rollback: Label-based rollback to last-good config; circuit-breaker flag as a kill switch. Hands-On Setup (Azure) Prerequisites Azure subscription + CLI (az)AKS cluster with OIDC issuer enabled (for Kubernetes Workload Identity)Resource group and VNet (recommended)Optional: Azure Cache for Redis; Log Analytics workspace 1. Create a key vault (for any secrets your backend needs to resolve). Create a resource group (if needed): Shell az group create -n rg-flags -l eastus Create Key Vault az keyvault create: Shell -g rg-flags -n kv-flags- --location eastus Add a secret (example): Shell az keyvault secret set --vault-name kv-flags- --name third-party-api-key --value "s3cr3t-value" 2. Create Azure App Configuration. Create the store: Shell az appconfig create -g rg-flags -n appcs-flags- -l eastus Add a simple flag (value is JSON): Shell az appconfig kv set \ --name appcs-flags- \ --key "flag:newCheckout" \ --label "prod" \ --value '{"enabled": false}' Optional: Add a Key Vault reference (for server-side use only)In App Configuration, create a key with content type: application/vnd.microsoft.appconfig.keyvaultref+json;charset=utf-8Value: {"uri":"https://kv-flags-.vault.azure.net/secrets/third-party-api-key"}Your backend resolves this; never surface to the SPA. 3. Enable diagnostics to Log Analytics (for auditability). Create a Log Analytics workspace (if you don’t have one): Shell az monitor log-analytics workspace create -g rg-flags -n law-flags- -l eastus Get IDs: Shell APP_CONFIG_ID=$(az appconfig show -g rg-flags -n appcs-flags- --query id -o tsv) LAW_ID=$(az monitor log-analytics workspace show -g rg-flags -n law-flags- --query id -o tsv) Send App Configuration diagnostic logs and metrics to Log Analytics: Shell az monitor diagnostic-settings create \ --name appcs-diag \ --resource $APP_CONFIG_ID \ --workspace $LAW_ID \ --logs '[{"category":"ConfigurationStoreRead","enabled":true},{"category":"ConfigurationStoreWrite","enabled":true},{"category":"ConfigurationStoreSnapshot","enabled":true}]' \ --metrics '[{"category":"AllMetrics","enabled":true}]' 4. AKS identity and permissions (Workload Identity recommended). Create a user-assigned managed identity (UAMI): Shell az identity create -g rg-flags -n uami-flags Grant App Configuration Data Reader on the App Configuration store to the: Shell UAMI_ID=$(az identity show -g rg-flags -n uami-flags --query principalId -o tsv) APP_CONFIG_SCOPE=$(az appconfig show -g rg-flags -n appcs-flags- --query id -o tsv) az role assignment create \ --assignee-object-id $UAMI_ID \ --assignee-principal-type ServicePrincipal \ --role "App Configuration Data Reader" \ --scope $APP_CONFIG_SCOPE If you will resolve Key Vault secrets server-side, grant Key Vault Secrets User (or Reader + Get secret): Shell KV_SCOPE=$(az keyvault show -g rg-flags -n kv-flags- --query id -o tsv) az role assignment create \ --assignee-object-id $UAMI_ID \ --assignee-principal-type ServicePrincipal \ --role "Key Vault Secrets User" \ --scope $KV_SCOPE Federate the identity with your AKS service account (Workload Identity).Get AKS OIDC issuer: AKS_OIDC=$(az aks show -g -n --query "oidcIssuerProfile.issuerUrl" -o tsv): Create the federated credential on the UAMI. Shell az identity federated-credential create \ --name fc-flags \ --identity-name uami-flags \ --resource-group rg-flags \ --issuer $AKS_OIDC \ --subject system:serviceaccount:flags-namespace:flags-sa \ --audience api://AzureADTokenExchange In your cluster, create the namespace and service account annotated for Workload Identity. Shell kubectl create namespace flags-namespace kubectl apply -n flags-namespace -f - <<EOF apiVersion: v1 kind: ServiceAccount metadata: name: flags-sa annotations: azure.workload.identity/client-id: "$(az identity show -g rg-flags -n uami-flags --query clientId -o tsv)" EOF 5. Optional: Azure Cache for Redis. Create a cache (Basic or Standard).Capture the connection string (for ioredis). 6. Deploy the Config Proxy service on AKS. Build/publish your Node/Express image.Deploy a Kubernetes Deployment and Service that: Runs with the flags-sa service account.Sets environment variables: APP_CONFIG_ENDPOINT=https://appcs-flags-.azconfig.ioFLAG_LABEL=prod (or pilot/canary)REDIS_URL=redis://:@:6380 (optional, TLS recommended)Exposes port 8080.Route/flags through App Gateway, and set Akamai behavior to cache/flags with a short TTL; respect Vary header. TypeScript Code: Config Proxy (Node/Express in AKS) Uses Managed Identity (DefaultAzureCredential) to read flags from App Configuration.Applies tenant/segment targeting server-side.Sends safe, cacheable responses with ETag and Vary headers. TypeScript // server/flags.ts import express from "express"; import crypto from "crypto"; import { AppConfigurationClient } from "@azure/app-configuration"; import { DefaultAzureCredential } from "@azure/identity"; import Redis from "ioredis"; const app = express(); const credential = new DefaultAzureCredential(); const endpoint = process.env.APP_CONFIG_ENDPOINT!; // e.g. https://myappconfig.azconfig.io const client = new AppConfigurationClient(endpoint, credential); const redis = process.env.REDIS_URL ? new Redis(process.env.REDIS_URL) : null; type Flag = { key: string; value: string; label?: string }; type PublicFlags = Record<string, boolean | string | number>; const TARGET_LABEL = process.env.FLAG_LABEL || "prod"; // prod | pilot | canary const MAX_AGE = 30; // seconds function selectForTenant(all: Flag[], tenantId: string): PublicFlags { const out: PublicFlags = {}; for (const kv of all) { // Expected JSON value: {enabled:boolean, allowTenants?:string[], percent?:number} try { const cfg = JSON.parse(kv.value) as { enabled: boolean; allowTenants?: string[]; percent?: number; // 0..100 rollout variant?: string; // optional variant name }; if (!cfg.enabled) continue; // Tenant allow-list if (cfg.allowTenants && !cfg.allowTenants.includes(tenantId)) continue; // Progressive rollout (hash-based stable bucketing) if (typeof cfg.percent === "number") { const hash = crypto.createHash("sha1").update(`${kv.key}:${tenantId}`).digest()[0]; const bucket = (hash / 255) * 100; if (bucket > cfg.percent) continue; } // Expose safe values only out[kv.key] = cfg.variant ?? true; } catch { /* ignore malformed */ } } return out; } app.get("/flags", async (req, res) => { const tenantId = req.header("X-Tenant-Id") || "unknown"; const cacheKey = `flags:${TARGET_LABEL}:${tenantId}`; if (redis) { const cached = await redis.get(cacheKey); if (cached) { res.set("Cache-Control", `public, max-age=${MAX_AGE}`); res.set("Vary", "X-Tenant-Id"); const etag = crypto.createHash("sha1").update(cached).digest("hex"); res.set("ETag", etag); return res.type("application/json").send(cached); } } const iter = client.listConfigurationSettings({ labelFilter: TARGET_LABEL, keyFilter: "flag:*" }); const all: Flag[] = []; for await (const s of iter) all.push({ key: s.key.replace(/^flag:/, ""), value: s.value!, label: s.label }); const filtered = selectForTenant(all, tenantId); const body = JSON.stringify(filtered); const etag = crypto.createHash("sha1").update(body).digest("hex"); res.set("Cache-Control", `public, max-age=${MAX_AGE}`); res.set("ETag", etag); res.set("Vary", "X-Tenant-Id"); if (redis) await redis.setex(cacheKey, MAX_AGE, body); res.type("application/json").send(body); }); app.listen(8080, () => console.log("Flags service running on :8080")); TypeScript Code: React Hook to Consume Flags Safely Calls/flags with tenant header.Supports optimistic default values and refresh on ETag change.Keeps UI deterministic for a user session. TypeScript // web/useFlags.ts import { useEffect, useState } from "react"; import axios from "axios"; export type Flags = Record<string, boolean | string | number>; export function useFlags(tenantId: string, defaults: Flags = {}): Flags { const [flags, setFlags] = useState<Flags>(defaults); useEffect(() => { let active = true; axios .get<Flags>("/flags", { headers: { "X-Tenant-Id": tenantId } }) .then((r) => { if (active) setFlags({ ...defaults, ...r.data }); }) .catch(() => { // fall back to defaults on errors }); return () => { active = false; }; }, [tenantId]); return flags; } // Example usage in a component // const flags = useFlags(user.tenantId, { "newCheckout": false }); // return flags["newCheckout"] ? <NewCheckout /> : <LegacyCheckout />; Progressive Delivery Patterns Targeting by tenant or segment: use allowTenants lists for pilot customers, then expand via percent rollout for scale.Blast radius control: start with internal tenants only; require a runbook (owner, success metrics, rollback label).Kill switch: a global flag (flag:kill.newCheckout) that forces fallback rendering in the SPA.Edge correctness: set Vary: X-Tenant-Id and avoid Authorization-bearing responses being cached. In Akamai, mark /flags as cacheable for short TTL with the Vary header to avoid cross-tenant bleed. Practical Governance Flag lifecycle in Git: Propose: PR adds flag spec (key, owner, intent, default, expiry date).Active: rollout plan with metrics and alert thresholds.Sunset: remove code paths and delete from App Configuration.CI/CD (GitHub Actions/Azure DevOps): Import/export kv: az appconfig kv import --source file --format json --label prodProtect prod with approvals; promote via labels (dev → pilot → prod).Drift detection: nightly job exports App Configuration and diffs with Git; alert on mismatch to Log Analytics.Auditability: Enable diagnostic logs on App Configuration; ship to Log Analytics Workspace.Tag each change with the change reason, PR link, and owner.Dashboard: correlate exposure % to error rate and LCP on the new path; auto-trigger rollback if SLO breaches. Rollback Playbook Repoint label prod to the last good snapshot or reduce the percent to 0.Clear Redis caches; Akamai purge URL/flags if needed.Announce in the channel and link to the dashboard. Post-incident, remove the problematic code path. Flag Data Model (in App Configuration) Keys: flag:newCheckout (value is JSON)Label: prod|pilot|canaryExample value: {"enabled": true, "allowTenants": ["att-internal","tenantA"], "percent": 10, "variant": "v2"} Test the Toggle Logic (Jest) TypeScript // server/selectForTenant.test.ts test("percent rollout yields stable buckets", () => { const f = { key: "newCheckout", value: JSON.stringify({ enabled: true, percent: 10 }) }; const a = selectForTenant([f as any], "tenantA"); const b = selectForTenant([f as any], "tenantA"); expect(a["newCheckout"] === b["newCheckout"]).toBe(true); // deterministic }); What’s Next Add per-feature metrics: flag key → route-level Core Web Vitals, conversion, and error budgets.Automate expirations: a bot PR that removes flags past expiry date and deletes KV entries.
Introduction Modern banking is rapidly moving toward intelligent, context-aware automation. Instead of relying solely on static rule-based chatbots or hard-coded APIs, today’s systems need to understand intent, reason across multiple data sources, and deliver personalized insights in real time. This is where Agentic AI — autonomous systems capable of planning, reasoning, and taking action — is transforming customer interactions. In this article, we’ll explore how to build a customer service agent that leverages Amazon Bedrock and Amazon DynamoDB, powered by Boto3, to deliver dynamic responses such as: Retrieving personalized banking offers from a knowledge base.Detecting duplicate transactions or anomalies across accounts.Maintaining contextual memory across multi-turn conversations for continuity. While AWS Bedrock provides secure access to foundation models such as Claude 3 and Titan Text for reasoning and natural language understanding, DynamoDB serves as the agent’s persistent memory layer, storing query context, session history, and transaction summaries. Together, they enable the agent to recall previous interactions, infer intent, and execute logic-driven decisions — the key pillars of Agentic AI. In the sections that follow, we’ll walk through how to: Connect to Bedrock and DynamoDB using Boto3.Design an Agentic AI loop that queries a knowledge base, interprets results, and reasons over them.Persist and retrieve conversation state for memory-driven behavior.Implement practical use cases such as checking duplicate transactions and fetching bank offers with minimal latency. By the end, you’ll have a working prototype of a Bedrock-powered AI agent that can think, remember, and respond like a human banking assistant — securely deployed on AWS and ready for enterprise-scale integration. Abstract This article presents a practical implementation of an Agentic AI system powered by Amazon Bedrock. It integrates a Bedrock base model (Titan) with Amazon DynamoDB to handle real-time data. The solution is showcased through a lightweight Streamlit UI, leveraging LangChain capabilities to enable context-aware conversations. System Architecture: Fig. 1: Detailed Implementation using RAG, LangGraph implementation. Built using Boto3, the official AWS SDK for Python, this system demonstrates how to combine foundation AI models with real-time backend data and make a contextually aware chatbot to enable intelligent, responsive behavior. Key components include: Amazon Bedrock for invoking advanced LLMs like Claude and Titan with secured, managed infrastructure.Amazon routing agents to support Retrieval Augmented Generation (RAG), allowing the agent to ground its responses in relevant, structured knowledge.Amazon DynamoDB to ingest and serve real-time operational data, which the agent references during task execution.Streamlit for a simple yet powerful UI that lest users interact with the agent and observe real-time data usage.LangChain to simplify the development and deployment of LLM-powered applications. This architecture enables the development of a responsive AI agent that can access up-to-date information during conversation or task execution — making it suitable for use cases like a support bot, monitoring dashboard, or operational assistance. Tags: AI, AgenticAI, Bedrock, Langchain, DynamoDB Architecture Overview Fig.2: Use Case we are building. The architecture showcases how a Customer Service Agent interacts with a cloud-native Agentic AI system built using Amazon Bedrock, DynamoDB, AWS Lambda, and Streamlit. The design focuses on enabling intelligent, real-time responses to customer queries such as “What credit offers are available for me?” or “Do I have duplicate transactions this month?” The workflow follows a modular and event-driven design, ensuring scalability, low latency, and contextual continuity. 1. Customer Service Agent Interface (Streamlit UI) The Streamlit component acts as a lightweight, browser-based UI for demonstration. The agent or end-user enters a natural-language query — e.g., “Check duplicate transactions for customer ID 12345.” Streamlit sends the request to a backend Lambda function via a REST API call. When the AI finishes reasoning, Streamlit shows the structured response (such as offers, flags, or recommended actions). This layer is intentionally simple for prototyping, but can be replaced with an internal CRM interface, chatbot, or voice assistant in production. 2. AWS Lambda — Orchestration and Reasoning Layer AWS Lambda acts as the central orchestrator connecting multiple AWS services. It receives the request from Streamlit, parses the intent, and dynamically decides the query type — offer inquiry or duplicate check.For offer lookups, Lambda queries Amazon Bedrock’s Knowledge Base (see below) to retrieve contextually relevant marketing or credit-card offers.For duplicate-transaction checks, Lambda queries Amazon DynamoDB to retrieve and compare customer transaction data.The reasoning loop (plan → query → respond) is managed entirely in Python using Boto3, allowing fine-grained control over Bedrock prompts and DynamoDB queries. This serverless pattern removes the need for long-running backend servers and scales automatically with incoming traffic. 3. Amazon Bedrock — Reasoning and Knowledge-Retrieval Engine Amazon Bedrock provides the LLM reasoning core of the architecture. The Lambda function calls a Bedrock model such as Claude 3 Sonnet or Titan Text Express through the Boto3 SDK. Bedrock’s Knowledge Base feature is configured to index and embed offer-related documents (for example, credit-card reward summaries, APR policies, or limited-time promotions). When a query like “Show me the best cashback offers for my profile” is received, Bedrock retrieves the top relevant chunks from its knowledge base and generates a natural-language summary that the agent can directly relay to the customer. This combination provides contextual retrieval + generative reasoning, enabling fact-driven yet conversational responses. 4. Amazon DynamoDB — Agent Memory and Customer Data Store DynamoDB stores all customer-specific data, such as profile details, transaction logs, and session history. It enables the AI agent to persist memory between conversations — critical for the Agentic AI loop.For duplicate-transaction detection, Lambda runs DynamoDB queries to fetch recent transactions for a given customer_id and then compares timestamps, amounts, and merchant codes to identify potential duplicates.This memory layer can also hold session context (e.g., “Customer asked for offers 5 minutes ago”) to personalize responses. DynamoDB’s low-latency read/write performance ensures near real-time responsiveness even under heavy query loads. 5. Amazon S3 — Knowledge Base Document Source Amazon S3 acts as the document repository for Bedrock’s Knowledge Base. Banking and marketing teams can upload offer PDFs, product sheets, and promotional documents into a designated bucket.Bedrock continuously indexes these documents and converts them into embeddings for semantic search.When a user requests an offer summary, Bedrock pulls the relevant vector embeddings from S3-backed content and crafts the response. This approach provides a secure, easily maintainable knowledge source that non-technical teams can update without redeploying code. 6. Data Flow Summary Customer Service Agent (Streamlit) sends a query. Lambda receives it, detects intent, and routes accordingly.For offer queries => Lambda invokes Bedrock Knowledge Base, which fetches relevant offer content from S3.For duplicate-transaction queries => Lambda queries DynamoDB for the customer’s recent records and uses Bedrock to summarize the analysis. The final structured response is returned to Streamlit, completing the reasoning loop. Advantages of This Architecture: Serverless scalability: Fully event-driven through Lambda.Persistent memory: DynamoDB retains agent state and customer data.Knowledge grounding: Bedrock KB ensures responses are sourced from verified documents.Rapid prototyping: Streamlit provides an instant front-end for demonstrations.Separation of concerns: Reasoning (Bedrock), orchestration (Lambda), and data (DynamoDB + S3) are cleanly decoupled. Example Scenario: A customer asks, “Can you show me current credit-card offers and check if I’ve been charged twice for the same merchant?”The agent UI forwards the query to Lambda, which invokes Bedrock to retrieve promotional data from its knowledge base and concurrently checks DynamoDB for duplicate transaction patterns. The combined response — validated offers plus a transaction check summary — is sent back instantly to the agent dashboard. Implementation 1. ** Prerequisites ** AWS Services: Bedrock (knowledge base enabled), DynamoDB, S3, Lambda Python 3.9 + and dependencies pip install boto3 streamlit IAM Permissions bedrock:InvokeModel bedrock:Retrieve dynamodb:PutItem / GetItem / Query s3:GetObject 2. Initialize Boto3 Clients Python import boto3 region = "us-east-1" bedrock = boto3.client("bedrock-runtime", region_name=region) dynamodb = boto3.resource("dynamodb", region_name=region) table = dynamodb.Table("transactions") s3 = boto3.client("s3", region_name=region) 3. DynamoDB Schema Design: Table Name: CustomerTransactions Partition Key: customer_id (String) Sort Key: txn_id (String or UUID) 4. Querying DynamoDB: Python from boto3.dynamodb.conditions import Key def find_duplicate_txns(customer_id: str, amount: float, merchant: str): response = table.query( KeyConditionExpression=Key('customer_id').eq(customer_id) ) txns = response['Items'] duplicates = [ t for t in txns if t['merchant'] == merchant and abs(t['amount'] - amount) < 0.01 ] return duplicates 5. Retrieve Offer via Bedrock Knowledge Base Python def query_bedrock_offers(customer_profile: str, query: str): body = { "input": query, "knowledgeBaseId": "kb-1234567890", # your KB ID "retrievalConfiguration": {"type": "VECTOR"}, "modelId": "anthropic.claude-3-sonnet" } response = bedrock.invoke_model(body=body, modelId="anthropic.claude-3-sonnet") answer = response["body"].read().decode("utf-8") return answer 6. Combine Reasoning and Data — Lambda Python import json def lambda_handler(event, context): body = json.loads(event['body']) query = body.get('query') customer_id = body.get('customer_id') if "offer" in query.lower(): response_text = query_bedrock_offers(customer_id, query) elif "duplicate" in query.lower(): duplicates = find_duplicate_txns(customer_id, amount=200.0, merchant="Amazon") response_text = f"Found {len(duplicates)} potential duplicate transactions." else: response_text = "Sorry, I couldn't recognize or interprit your query." return { "statusCode": 200, "body": json.dumps({"response": response_text}) } 7. Streamlit Frontend Python import requests import streamlit as st st.title("AI-Powered Banking Customer Service Agent") customer_id = st.text_input("Customer ID", "CUST123") query = st.text_area("Enter query", "Show me available credit offers") if st.button("Ask"): payload = {"customer_id": customer_id, "query": query} resp = requests.post("https://aws-lambda-api-url", json=payload) st.success(resp.json()["response"]) Example: The agent enters: “Check for duplicate transactions for CUST123.” Streamlit sends the query to Lambda. Lambda detects intent → runs find_duplicate_txns() → returns summary. The agent asks: “Now show available credit offers.” Lambda invokes query_bedrock_offers() → Bedrock retrieves context from KB → returns offer list. Streamlit displays the final response in seconds. Conclusion Agentic AI marks a new phase of intelligent automation in the financial industry — moving from simple chatbots to autonomous reasoning systems that understand intent, retrieve verified data, and deliver actionable insights. By combining Amazon Bedrock for language reasoning, DynamoDB for persistent agent memory, and AWS Lambda for orchestration, developers can create stateful AI agents capable of handling real-world banking scenarios like offer recommendations, duplicate transaction checks, or contextual financial insights. This architecture is: Scalable — serverless compute and on-demand storage Secure — AWS-native IAM and encryption controls Extensible — ready for multi-agent or event-driven workflows As cloud-native AI continues to evolve, the integration of reasoning (Bedrock), memory (DynamoDB), and knowledge grounding (S3 + Bedrock KB) will form the backbone of next-generation AI-driven customer service platforms — empowering agents to think, remember, and respond with true contextual intelligence.
Sampling is a fundamental process in machine learning that involves selecting a subset of data from a larger dataset. This technique is used to make training and evaluation more efficient, especially when working with massive datasets where processing every data point is impractical However, sampling comes with its own challenges. Ensuring that samples are representative is crucial to prevent biases that could lead to poor model generalization and inaccurate evaluation results. The sample size must strike a balance between performance and resource constraints. Additionally, sampling strategies need to account for factors such as class imbalance, temporal dependencies, and other dataset-specific characteristics to maintain data integrity. The most common languages used for sampling are Scala and PySpark, both maintained by the Apache Foundation. And a common challenge in these languages is that when we do sampling, the entire dataset is stored in one machine, thereby leading to memory errors. In this article, I will explain how random sampling can be achieved at scale using Scala Spark and how the central limit theorem can be extended to solve this problem. Problem Statement One of the most common ways to obtain a sample dataset in Spark is by using the df.sample method on a DataFrame. In Spark, a DataFrame is a distributed object that stores your dataset’s rows and enables large-scale data processing. It leverages the MapReduce paradigm to perform operations efficiently across a cluster (we’ll cover MapReduce in more detail in future articles). While the sample method is convenient, it can become problematic as your data grows. Internally, (Older versions) Spark collects the relevant data onto a single machine in the cluster before applying the random sampling logic. If your grouping or sampling operation results in a large dataset, this single machine may run out of memory (OOM error). To address this, you might consider using larger machines or increasing the available memory. However, both approaches have drawbacks: Larger instances increase your cloud provider costs. When you use larger instances just to support sampling or a large job with skewed data, your entire cluster can be idle, which can lead to increased costs. Ideal usage should be around 80-90% of the cluster for CPU and memory. I will share more details, specifically on emitting metrics from the Apache Spark cluster and how we can use it to enable alarms for various scenarios. More memory can slow down your cluster, especially for jobs that don’t require such large resources. If we use more memory just to solve the runtime issue, some jobs in the partition might not need that much memory, and they will be underutilized even though the Spark UI shows full memory usage. You will observe that the number of CPUs used decreases in this use case, but the memory is still allocated. For an efficient job, you need your jobs to be able to work on smaller chunks of data.It’s important to note that Spark allocates memory for the entire MapReduce job up front, so optimizing your sampling strategy is crucial for both performance and cost efficiency. While the above two solutions work, they are sub-optimal. To mitigate this, we can use the central limit theorem. Solution To solve this problem, we can use the central limit theorem, which basically says that if we pick a number between 0 and 1 and repeat this process n times, then the numbers are normally distributed. This can be applied to our sampling usecase here. If we want to sample, say, 10% of the rows, then for each row we randomly pick a number between 0 and 1, filter the rows that are greater than 0.1, and then sample the dataset to 10% randomly. Here is a sample code to achieve this sampling. Scala import org.apache.spark.sql.functions._ import org.apache.spark.sql.{DataFrame, Encoders, Row} object DistributedSampler { def distributedSample1(df: DataFrame, fraction: Double): DataFrame = { df.withColumn("rand_val", rand()) // add random number 0–1 .filter(col("rand_val") < fraction) // keep rows that fall under the threshold } // sample each partition independently def distributedSample(df: DataFrame, fraction: Double): DataFrame = { df.mapPartitions { iter => val rnd = new scala.util.Random() iter.filter(_ => rnd.nextDouble() < fraction) }(Encoders.kryo[Row]) // or Encoders.row } } Test code to verify the results with Spark sampling vs. the new approach: Scala import org.apache.spark.sql.SparkSession object SamplingTest { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("SamplingTest") .master("local[*]") .getOrCreate() import spark.implicits._ val df = spark.range(0, 1000000000).toDF("id") // Spark built-in val t0 = System.currentTimeMillis() val s1 = df.sample(false, 0.1) println(s1.count()) val t1 = System.currentTimeMillis() println(s"Default Spark sample time: ${t1 - t0} ms") // Distributed-safe val t2 = System.currentTimeMillis() val s2 = DistributedSampler.distributedSample1(df, 0.1) println(s2.count()) val t3 = System.currentTimeMillis() println(s"Distributed sample time: ${t3 - t2} ms") spark.stop() } } Results Number of rows Spark default sampling time (ms) New approach sampling time (ms) 10000 1071 186 100000 1183 224 1000000 1215 239 10000000 1295 243 100000000 1310 321 From the results, we can see that the new approach performed better than Spark's default sampling. I am running this on a single-node machine, so the results might slightly vary, but the sample code shows that the approach works and is efficient. Limitations This solution only works if we want to sample the entire dataset randomly; this will not work. In some cases, I have seen records going beyond the sample size by a few records because of the nature of randomness; it's not a big number, but something to be aware of. For small datasets, randomness is high; for example, sampling 1000/10000 records can be done directly by Spark sampling, but if the number of rows exceeds 100k, the CLT approach also works efficiently. In the next article, we will discuss emitting metrics from Spark and how it helps in analyzing your jobs.
In the domain of corporate marketing, “brand consistency” is the golden rule. However, enforcing that rule is often a manual, tedious nightmare. Marketing teams spend countless hours reviewing PDFs and slide decks to ensure the logo has enough padding, the fonts are correct, and the color gradients align with the style guide. For developers, this looks like a solvable problem. With the advent of multimodal large language models (LLMs) capable of processing text and images simultaneously, we can now build pipelines that “see” a document and “read” a brand rulebook to perform automated audits. Based on recent enterprise case studies, this article outlines a technical architecture for building a brand compliance tool. We will explore how to combine document parsing with LLMs to automate the design review process. The Architecture: A Two-Stage Verification Pipeline The core challenge in automating brand review is that a document is not just text; it is a spatial arrangement of visual elements. A standard OCR scan isn’t enough — we need to know where the logo is and what it looks like. The successful pattern relies on a Detection → Verification pipeline. The Tech Stack Frontend: Streamlit (for rapid internal tool development)Backend: PythonModel: A multimodal LLM (e.g., Gemini 1.5 Pro or GPT-4o) capable of large context windows and image inputsData Format: Strict JSON enforcement for all LLM outputs Stage 1: Design Element Detection Before we can check whether a logo is correct, we have to find it. While object detection models (like YOLO) are fast, they require massive training datasets. An LLM with vision capabilities allows for zero-shot or few-shot detection using prompt engineering. Pre-processing If the input is a PDF, we don’t just snapshot it. We extract metadata to give the LLM “superpowers.” Rasterize the page into a high-resolution image (for the LLM’s eyes)Extract vector metadata (bounding boxes, text, color codes) using libraries like PyMuPDF or pdfplumber The Detection Prompt We feed the LLM the image of the page along with a prompt designed to categorize elements into specific buckets (e.g., logo, typography, photography). Prompt strategy: To prevent the LLM from hallucinating output formats, we must provide a strict schema. Python detection_prompt = """ You are a design expert. Analyze the attached image of a marketing document. Identify all design elements such as Logos, Text Blocks, and Images. Return a JSON list where each item contains: - "type": The category of the element (Logo, Typography, Iconography). - "bounding_box": Approximate coordinates [x, y, w, h]. - "visual_description": A detailed description of the element's appearance. - "content": (If text) The actual string content. Constraint: Do not include markdown formatting. Return raw JSON only. """ By passing the pre-calculated bounding boxes from the PDF parser into the prompt context, we significantly increase the LLM’s spatial accuracy. Stage 2: Compliance Verification Once we have a JSON list of elements, the system allows the user to select specific elements for verification. This is where the retrieval-augmented generation (RAG) concept applies to visual rules. We cannot simply ask, “Is this compliant?” without context. We must inject the specific rules from the brand guideline book. The Verification Logic If the system identifies a logo, it retrieves the relevant text from the brand guidelines (e.g., “Must have 20px padding,” “Do not use on red backgrounds”). Verification prompt pattern: Python verification_prompt = f""" You are a Brand Compliance Officer. Task: Verify if the following element adheres to the Brand Guidelines. Input Element: {selected_element_json} Brand Rules: {relevant_guideline_excerpt} Instructions: 1. Compare the visual description and properties of the element against the rules. 2. Output a verdict: "Correct", "Warning", or "Error". 3. Provide a reasoning for your verdict. 4. If "Error", suggest a fix. """ Handling Ambiguity with Chain-of-Thought Reasoning LLMs can struggle with subjective design rules. To mitigate this, the prompt should encourage chain-of-thought reasoning. Ask the model to list its observations before giving a verdict. Example output: JSON { "element_id": "logo_01", "observation": "The logo is placed on a busy photographic background. The logo color is black.", "rule_reference": "Logos must use the 'Reversed White' version when placed on dark or busy images.", "status": "Error", "suggestion": "Replace the black logo with the white transparent version." } The Gotchas: Real-World Limitations Building this system reveals important limitations in current AI capabilities. 1. The “Color Perception” Gap While LLMs excel at reading text and understanding layout, they struggle with precise color analysis. The issue: An LLM might describe “navy blue” as “black” or misinterpret a gradient angle. In early testing, accuracy for color and gradient verification often lags behind text verification (e.g., 53% accuracy for color vs. 90%+ for typography).The fix: Don’t rely on the LLM’s eyes for color. Use Python (Pillow/OpenCV) to sample pixels programmatically and pass the HEX codes to the LLM as text metadata. 2. PDF vs. Image Inputs Processing PDFs typically yields higher accuracy (92%+) compared to raw images (88%). Why: PDFs contain structural data (fonts, vector paths) that provide ground truth. Images rely entirely on the model’s visual inference.Best practice: Always prefer PDF uploads. If the user uploads a JPG, warn them that accuracy may degrade. Conclusion The brand compliance tool pattern demonstrates a shift in how we apply AI. We aren’t just using it to generate content — we’re using it to govern content. By combining the reasoning capabilities of multimodal LLMs with the strict logic of code-based pre-processing, developers can build tools that reduce manual review time by 50% or more. The key is not to trust the model blindly, but to guide it with structured prompts, strict JSON schemas, and programmatic data extraction. Next Steps for Developers Experiment with the Gemini 1.5 Pro or GPT-4o APIs for image analysisBuild a simple Streamlit app that accepts a PDFWrite a script to extract text and images from that PDFFeed the components to the LLM and ask: “Does this look professional?” You might be surprised by how much design sense an algorithm can have.
Most analytic teams in companies are entering a new demand phase where stakeholders don't just want dashboards; they want a conversation that delivers fast answers, follow-ups, and breakdowns. Some examples are: A Sr. Product Manager asks 'Why did the churn rate increase last week?'A Vendor Manager posts 'Which marketplace is it happening in?'Another person adds 'Can we break it down by product category?' Typically, the process looked like this: Open the dashboard. Interpret the data.Export or drill down into the details.Build a custom query. Check whether the results match the dashboard.Explain the outcome to the leadership if it matches.Contact the analytics team if it does not match to reconcile the definitions. This works, but it is slow and time-consuming. Every follow-up question restarts the cycle. Leadership expects quick answers, which is why teams turn to retrieval-augmented generation (RAG) to build chatbots or conversational analytics assistants. Reality hits after using these chatbots. The chatbot's number doesn't match the one on the dashboardThe number returned is correct, but the metric definition that it provided was wrongIt didn't filter the data according to your usecase Unfortunately, it leaks the restricted information, because the retrieval wasn't entitlement-aware These don't happen because RAG is bad; they happen because analytics is different. In business intelligence, the source of truth (SOT) is not documents, PDFs, or wiki pages. The SOT is the semantic layer, which means the governed definitions of metrics, joins, dimensions, access rules, etc., that make dashboards consistent. If you are designing RAG for analytics, the single most important principle is the semantic layer first and documents second. Thinking why? In this article, I will explain why this matters, what consequences you face if you ignore it, and how to design a semantic-layer first RAG system on Redshift with a BI modelling layer that produces results you can defend in a business review. Analytics Isn't A Document Problem Most RAG patterns are built for text-heavy domains such as compliance and policy documents, product manuals, runbooks, wikis, and customer support. In these cases, what does retrieval mean? It is the extraction of "find the most relevant passages" and then generating an answer that is in those passages. But analytics is different, it is more about 'what does business mean' and 'what do the numbers actually represent.' It is about interpretation and not mere documentation. A single KPI like 'churn rate' can hide complicated logic like: Are we measuring paid subscriber churn or any customer who didn't return?Should we exclude trials, refunds, fraud, or internal accounts?What is the time period? 7 days, 30 days, 12 months?Are we comparing calendar months or fiscal months?Are cancellations counted by the effective cancellation date or the billing end date?Is churn calculated against the beginning of the month actives or the end of the month? None of this is captured in documents. This is often stored in: The semantic model, which means the metric definitionsCurated tablesBusiness Intelligence Modelling LayerTransformation logic in ETLs/dbt models/curated data martsGovernance rules (certified datasets, access policies) This is what happens in the semantic layer. If the chatbots/assistants answer metric-related questions based on general documents first, it will eventually hallucinate definitions or apply wrong filters/grain even if the text sounds plausible. A Real-Life Example In an early prototype of the assistant, a stakeholder asked for 'monthly churn rate' and the assistant replied 4.8%, whereas my dashboard showed 3.9%. A huge mismatch to trigger an immediate escalation. The data isn't wrong. The mismatch came from the definition drift: The assistant referred to a wiki-style definition, which is customers who stopped using the product in a monthIt then generated SQL from scratch and counted inactive users (adding paid + trial customers)My dashboard churn definition was paid subscriber churn counted by effective cancellation date, excluding refunds and internal/test accounts, and normalised to the beginning of the month active paid subscribers. The assistant's number was plausible but wrong for the business definition that people know and trusted. We found the issue by tracing the query plan and noticing it hit an uncertified source and applied a different denominator than our BI model. That incident changed our thinking and design goal from 'answer fast' to 'answer defensibly'. What Breaks When You Index Docs First 1. KPI Hallucination User prompt: 'What is the churn for us?' A doc-first system might respond with a generic definition, which is customers who stop using the product monthly. But in your use case, churn may exclude trials, use cancellation dates vs. billing end dates, and be defined at the account level. The answer looks right, but it doesn't match the dashboard, and now you have created confusion instead of clarity. 2. Filter Drift User prompt: 'Revenue last quarter in North America?' The chatbot returns a number, but it's global because 'NA' wasn't included in the query plan. This is common when the system summarises a document that describes revenue but doesn't reliably bind regional filters to execution. 3. Grain Mismatch 'This week' might mean calendar week or fiscal week. 'Customers' might mean accounts or individuals. A docs-first system may pick a grain that reads nicely rather than the one your reporting uses. 4. Join Errors Even if the assistant generates SQL, incorrect joins can inflate metrics: many-to-many joins, duplicate dimension keys, wrong snapshot tables, or slowly changing dimension surprises. 5. Security Leakage If your vector index retrieves chunks the user shouldn't see, the model can leak sensitive context even indirectly. Analytic assistants must treat access control as a first-class design input and not a post-processing step. Semantic layers exist specifically to prevent these mistakes. What Does The Semantic Layer Really Do The semantic layer is the governed bridge between the raw data and business meaning. Different organisations implement it in a different manner, like in BI modelling layer, dbt semantic models, data marts, etc., but the responsibilities are consistent and remain the same. Metric definitions: formula, filters, grain, owners, versioningDimensions and relationships: group by, allowed joinsCertified sources: datasets or views approved for analyticsPolicy bindings: row and column level security rules If your assistant answers metric questions without consulting this layer, then it is ungoverned. The Semantic-Layer-First RAG Pattern A practical way to implement this is a 3-lane router. Lane 1: Definition Lane (Semantic Layer) For questions like 'How do we calculate churn rate?', 'What is the definition of active user?', add metric cards which include the definition, version, owner, and examples. Lane 2: Data Lane (Semantic Layer -> Governed Query -> Redshift) For questions like 'churn rate last month in NA by product group', 'Monthly active user trend over the last 6 months', add results, audit bundle, and a SQL snippet, which is entirely optional. Lane 3: Narrative Lane (Documents) For questions like 'what changed last week that could explain this?', 'Any launched/incidents that impacted checkout?', add the doc-grounded summary. The key idea is that numeric answers must originate from the semantic layer and be governed by queries and not from random docs. Code: Routing and Governed Query Rendering Below is the simplified sample code. The important constraint is that the model doesn't invent tables/joins. It selects from certified metric templates. Python from dataclasses import dataclass @dataclass class QueryPlan: metric_id: str metric_version: str grain: str # "month" month_type: str # "calendar" or "fiscal" (semantic default) filters: dict # e.g., {"region": "NA", "plan_tier": "Pro"} dimensions: list # e.g., ["plan_tier"] sql: str # rendered SQL def answer(question: str, user_ctx: dict): intent = route_intent(question) # definition | data | narrative | hybrid entities = extract_entities(question) # metric, dims, time, filters metric_card = get_metric_card("churn_rate") # definition + version + owner model_ctx = get_modeling_rules(entities.dimensions) policy_ctx = get_access_policy(user_ctx) if intent == "definition": return render_definition(metric_card) plan = build_governed_plan(metric_card, model_ctx, policy_ctx, entities) plan.sql = render_sql_template(plan) rows = run_redshift(plan.sql, user_ctx) # RLS enforced by views/roles checks = sanity_checks(rows, plan) return render_answer(rows, plan, checks) Sample SQL template: SQL -- Metric: churn_rate v7 (certified) -- Definition: paid subscriber churn = cancels_effective_in_month / begin_month_active_paid -- Grain: month (semantic default: fiscal_month) WITH monthly AS ( SELECT month_key, plan_tier, SUM(begin_month_active_paid) AS denom_begin_active, SUM(cancels_effective_in_month) AS num_cancels FROM certified.subscriber_churn_mart WHERE month_key BETWEEN :m_start AND :m_end AND region = :region AND is_internal_account = false GROUP BY 1,2 ), rates AS ( SELECT month_key, plan_tier, CASE WHEN denom_begin_active = 0 THEN NULL ELSE num_cancels::decimal(18,6) / denom_begin_active END AS churn_rate FROM monthly ), mom AS ( SELECT month_key, plan_tier, churn_rate, churn_rate - LAG(churn_rate) OVER (PARTITION BY plan_tier ORDER BY month_key) AS churn_rate_mom_delta FROM rates ) SELECT * FROM mom ORDER BY month_key, plan_tier; The Audit Bundle I Would Ship When the chatbot returns a number, it should return a 'proof capsule' that stakeholders can trust. Paid subscriber churn rate in NA increased by 3.9% in Dec 2025, which is up +0.6pp month over month. The increase happened in the Pro tier (+0.8pp), especially in Category X (+0.5pp). Audit bundle: Metric: churn_rate v7Numerator: cancels_effective_in_monthDenominator: begin_month_active_paidMonth type: fiscal_monthFilters: region=NA, is_internal_account = falseSource: certified.subscriber_churn_martFreshness: loaded_at=2025-12-30 02:05 UTCQuery: stored link/id This is what makes the assistant usable in a business review. Security Constraints You Must Respect And Keep In Mind Analytics chatbots/assistants are very risky since they combine language with data access. A few non-negotiables are: Enforcing access at execution time (Redshift role; RLS via certified views)Restrict queries to certified marts/explores onlyRetrieval must be entitlement-awareLog the chain: question -> plan -> query id -> dataset Conclusion A semantic-layer-first RAG system doesn't replace dashboards. It brings the dashboard's truth into a trustworthy conversation. For example, churn and MoM were definitions, time semantics, and denominators matter. If your assistant can answer 'What is churn?' and 'Is churn increasing MoM?' with a definition, a governed query, and an audit bundle, you have moved from 'chat about data' to 'trustworthy analytics in dialogue.'
You’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example, a very common dimension to partition data in data lakes is by date or time. However, what if your data querying requirements involve multiple dimensions? Let’s say you wish to query your data by field A and also by field B, or sometimes by field A but other times by field B. In this post, I’ll go over several common options for such a case. For convenience, I’ll give examples of how to implement it in the data lake using standard folder names and Parquet to store the data. You should know, however, that the paradigms are also valid for other areas like DBs, NoSQL DBs, memory storage, and so on. The Default: Micro-Partitions Micro-partitions are a technique used to sub-partition data within a dataset. Each micro-partition contains metadata for individual fields, providing valuable information for optimizing data consumption performance. For instance, consider a scenario where data is organized into daily partitions stored in Parquet files. Plain Text <dataset_root>/day=20240101/data1.parquet <dataset_root>/day=20240101/data2.parquet <dataset_root>/day=20240101/data3.parquet In this setup, each Parquet file (or even each page within a Parquet file) can be referred to as a micro-partition. Parquet files inherently store metadata per file and per page, which can enhance data consumption performance. Snowflake also employs micro-partitions by default, but it uses richer metadata and superior indexing capabilities than the simple Parquet files. This enhanced metadata and indexing within Snowflake’s micro-partitions contribute to significant performance gains, making micro-partitions a highly beneficial feature within the platform. The Obvious Approach: Nested Partitions Let’s start with nested partitions. In a typical Hive partition structure, it looks like this: Plain Text <dataset_root>/<FieldA>=<value>/<FieldB>=<value>/data.parquet While this works well for consistent queries involving both field A and field B, it falls short when you need the flexibility to query either field separately. For instance: Scala //Good for this: Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 "); //Not so good for this: Spark.sql("select * from my_data_set where FieldA=11"); Spark.sql("select * from my_data_set where FieldB=22"); The reason this method is not useful for these cases is that for the 2nd type of query, all partitions need to be scanned, which makes it not as useful. The Opposite Approach: Data Duplication With Separate Partitions Another approach involves duplicating the data and partitioning it once by field A and once by field B. The directory structure in a hive-like structure might look like this: Plain Text <dataset_root>/<FieldA>=<value>/data.parquet and Plain Text <dataset_root>/<FieldB>=<value>/data.parquet It represents the opposite of the previous option, meaning: Scala //Good for this: Spark.sql("select * from my_data_set where FieldA=11"); Spark.sql("select * from my_data_set where FieldB=22"); //Not good for this: Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 "); Also, maintaining data consistency becomes more challenging in this scenario. Best of Both Worlds? Partitioning by Field A + Externally Indexing by Field B A widely adopted strategy in databases. The advantage here is that the index serves as a reference to the data, not a copy of it. In the data lake world, it means partitioning the data by field A, the same as before. Plain Text <dataset_root>/<FieldA>=<value>/data.parquet And in addition, maintaining a slim dataset that references the same data files by field B values. In data lakes, it’s possible to implement it yourself, though it's usually implemented using an additional data catalog. This is also one of the advantages of using lakehouses (like the Databricks data lakehouse) since you get it out of the box. It’s ideal for cases where you need to query the data based on specific values for field B. Scala Spark.sql("select * from my_data_set where FieldB=22"); However, it’s less suitable for queries involving a range of values for field B. Scala Spark.sql("select * from my_data_set where FieldB>22"); The reason it is not as useful is that the indexed keys are not stored in a continuous manner on the machine, like partitions usually are. Often Useful: Partitioning by Field A + Clustering or Sorting by Field B This is an improvement over the micro-partitions approach. Here, we partition the data by field A as you normally do, but ensure that within each partition, the data is clustered by field B. Here is one example of how to implement it using Spark: Scala // partition the data partitioned by A and inside partitioned by B val sortedDF = df.repartitionByRange($"fieldA", $"fieldB") // than write the data in a partitioned manner sortedDF.write .mode(SaveMode.Overwrite) .partitionBy("fieldA") .parquet("/dataset_root") In the example above, data will be written partitioned by field A. But within each partition, the data will also be divided into files (micro-partitioned) by field B. The theologies used need to support this, of course. In the case of Parquet, it works well since Parquet holds metadata for each field, which includes min and max values. Most technologies (like Apache Spark) take this into account, so they can skip files that do not include the required value for field B. This is a solid choice for various use cases, while it is not the best approach for queries like this: Scala Spark.sql("select * from my_data_set where FieldB=22 "); or Scala Spark.sql("select * from my_data_set where FieldB>22 "); Since it means going over all partitions. However, since the data is grouped by field B within the partitions, at least some of the files may be skipped. This approach is particularly useful when field B contains a wide range of possible values (high cardinality). It can also be a beneficial design when field B’s values are unevenly distributed (skewed). This is why this paradigm is very common in multiple technologies, for example: clustering in BigQuery, sort key in DynamoDB. clustering within micro-partitions in Snowflake and so on. The Secret Weapon: Z-Order A less common but important option is to index or partition by a Z-order. In this case, the data is also sorted, but instead of being partitioned by field A and sorted by fields B, it will be sorted by a key that is a composite of both fields A and B: Plain Text <dataset_root>/<A combination of FieldA+fieldB> This method is actually ideal for all of the query types mentioned so far. The secret is in the method, which combines the two fields together; it makes sure that keys with similar values are stored in proximity to one another, and this holds true for both fields that make up the partition. So, no matter whether you’re retrieving data based on one field or both, whether you need a precise value or a range of values, this method will surely help. Also, like the previous method, this method is good for high cardinality and skews as well. The implementation of this is not very common, though, and quite complex. Currently, the most common implementations are by hosted environments like Databricks Lakehouse. Conclusion Choosing the right strategy for multi-dimensional data querying depends on your specific use case. Each approach has its strengths and trade-offs. Whether you go for nested partitions, data duplication, external indexing, sorting, or Z-indexing, understanding these strategies equips you to make informed decisions based on your data lake architecture and querying needs.
I've been wrangling React codebases professionally for well over ten years now, and honestly, the story is always the same in 2026: teams inherit these massive, everything-in-one-place apps built back when Create React App felt like the future. All the logic — auth, shopping cart, product lists, user profiles — lives in a handful of giant files. Props get drilled six levels deep, the state is scattered, and nobody wants to touch it because one wrong move brings the whole thing down. Last year, I led a refactor on a five-year-old dashboard exactly like that. We managed to break it into proper feature slices and even laid the groundwork for microfrontends. The thing that made the biggest difference? A multi-agent AI setup that did a lot of the heavy lifting for us. It wasn't magic — it still needed human eyes — but it turned a three-month nightmare into something we wrapped in five weeks. In this piece, I'll walk you through how I built that system. We'll take a messy little React monolith (the kind you see everywhere) and let a team of AI agents analyze it, plan the refactor, write the new modular code, add tests, and review everything. We'll use LangGraph to orchestrate the agents and Claude 3.5 Sonnet as the LLM (though GPT-4o works fine too). What You'll Need Nothing exotic: Node 20+ and your package manager of choice.Python for the agent orchestration (LangChain/LangGraph live there — it's still the most reliable option).An Anthropic API key (or OpenAI). Just export it as ANTHROPIC_API_KEY.Git and VS Code. I lean heavily on the Cursor extension these days for quick diff reviews. Grab the sample app we'll be working with — a tiny e-commerce dashboard where login, product list, and cart are all crammed into src/App.js. It's deliberately ugly, but painfully realistic. Here's the heart of the mess: JavaScript import React, { useState } from 'react'; import './App.css'; function App() { const [user, setUser] = useState(null); const [cart, setCart] = useState([]); const [products] = useState([{ id: 1, name: 'Widget', price: 10 }]); const login = (username, password) => { if (username === 'admin') setUser({ username }); }; const addToCart = (product) => { setCart([...cart, product]); }; return ( <div className="App"> {!user ? ( <form onSubmit={(e) => { e.preventDefault(); login(e.target.username.value, e.target.password.value); }> <input name="username" placeholder="Username" /> <input name="password" type="password" /> <button>Login</button> </form> ) : ( <> <h1>Welcome, {user.username}</h1> <div> <h2>Products</h2> {products.map(p => ( <div key={p.id}> {p.name} - ${p.price} <button onClick={() => addToCart(p)}>Add to Cart</button> </div> ))} </div> <div> <h2>Cart ({cart.length})</h2> {/* cart items would go here */} </div> </> )} </div> ); } export default App; You get the idea: everything lives in one component, auth is fake and insecure, no routing, no code splitting. Why Legacy React Apps Are Such a Pain Most big companies are still running apps that started life pre-React 18. Giant components, prop drilling everywhere, bundle sizes that make mobile users cry. Adding a new feature means touching half the codebase and praying the tests (if they exist) still pass. Agentic workflows help because they can read the whole thing at once, spot patterns we miss when we're deep in the weeds, and churn out consistent modular code faster than any human could. The Agent Team I run five specialized agents that hand work off to each other: Analyzer – reads the code and produces a structured report.Planner – turns that report into concrete steps.Coder – writes the actual refactored files.Tester – generates meaningful tests.Reviewer – catches anything that slipped through. The Analyzer we already made pretty thorough in the last version. Let's spend more time on the two that do the real work: Coder and Tester. Coder Agent This is the one that actually moves code around. I've learned the hard way that vague prompts lead to broken imports and forgotten lazy loading, so I lock it down pretty tight. Here's the system prompt I use: Python coder_prompt = ChatPromptTemplate.from_messages([ ("system", """You're a senior React engineer whose specialty is cleaning up old monoliths. Implement the refactor plan exactly—no creative detours. Rules I always follow: - Functional components and hooks only. - Feature-sliced layout: src/features/auth/, src/features/products/, src/features/cart/ - React Router v6+ with proper <Routes> and <Route> - Every route component wrapped in React.lazy() + Suspense for code splitting - Shared state lives in dedicated contexts under src/context/ - Forms are fully controlled (no e.target.username nonsense) - Components stay small and focused - Relative imports must be correct in the new structure - Don't add new dependencies unless the plan explicitly says so Output must be a JSON object: keys are full file paths, values are complete file contents. Include every new or changed file. Nothing else."""), ("user", """Analysis JSON: {analysis_json} Original files: {original_files} Plan: {plan}""") ]) Tester Agent Good tests are what keep me from losing sleep after a refactor. The tester prompt forces realistic RTL/Jest tests: Python tester_prompt = ChatPromptTemplate.from_messages([ ("system", """You're a frontend testing specialist. Write clean, useful tests with React Testing Library and Jest. For every important new or changed component: - Test rendering and key interactions - Use proper roles and accessible queries - Mock contexts when needed - Include at least one error/empty state test where it makes sense - Keep tests focused—aim for meaningful coverage, not 100% theater Output JSON: keys are test file paths (e.g. src/features/auth/LoginForm.test.jsx), values are full test files."""), ("user", "Refactored files: {refactored_files}") ]) What Happens When We Run It Feed the original App.js into the workflow. The Analyzer spots the usual suspects — high-severity coupling, oversized component, no code splitting, insecure auth — and gives us a nice JSON plan. Coder takes that plan and produces things like: A proper LoginForm.jsx with controlled inputsSeparate ProductsList.jsx and Cart.jsxContext providers for auth and cartAn AppRoutes.jsx that looks roughly like this: JavaScript import React, { Suspense } from 'react'; import { BrowserRouter, Routes, Route, Navigate } from 'react-router-dom'; const LoginForm = React.lazy(() => import('./features/auth/LoginForm')); const ProductsList = React.lazy(() => import('./features/products/ProductsList')); const Cart = React.lazy(() => import('./features/cart/Cart')); function AppRoutes() { return ( <BrowserRouter> <Suspense fallback={<div>Loading...</div>}> <Routes> <Route path="/login" element={<LoginForm />} /> <Route path="/products" element={<ProductsList />} /> <Route path="/cart" element={<Cart />} /> <Route path="*" element={<Navigate to="/login" />} /> </Routes> </Suspense> </BrowserRouter> ); } export default AppRoutes; Tester then writes solid tests — one of my favorites from a real run: JavaScript import { render, screen, fireEvent } from '@testing-library/react'; import LoginForm from './LoginForm'; import { AuthContext } from '../../context/AuthContext'; const renderWithContext = (ui, { user = null, login = jest.fn() } = {}) => { return render( <AuthContext.Provider value={{ user, login }> {ui} </AuthContext.Provider> ); }; test('submits credentials correctly', () => { const mockLogin = jest.fn(); renderWithContext(<LoginForm />, { login: mockLogin }); fireEvent.change(screen.getByPlaceholderText('Username'), { target: { value: 'admin' } }); fireEvent.change(screen.getByLabelText(/password/i), { target: { value: 'secret' } }); fireEvent.click(screen.getByRole('button', { name: /login/i })); expect(mockLogin).toHaveBeenCalledWith('admin', 'secret'); }); The Reviewer usually asks for one or two small tweaks (like adding a redirect after login), we loop back to Coder, and we're done. Running the Tests and Shipping npm test on the generated suite usually passes after the first or second iteration. Bundle size drops noticeably once the lazy loading is in place. I still review every diff in Cursor — AI doesn't get a free pass — but the volume of clean, consistent code it produces is night-and-day compared to doing it all manually. Lessons From the Trenches The detailed, structured prompts are what make this actually usable in real projects. Loose instructions = chaos. JSON output with file paths = easy automation. We've used this pattern on much larger apps (10–15k lines) and consistently needed only minor manual fixes afterward. Important Caveats If You're Thinking of Running This on Your Own Monolith Look, this setup works great on small-to-medium apps (a few hundred to a couple thousand lines), and it's a fantastic way to prototype a refactor or clean up a prototype. But before you point it at your company's million-line dashboard, here are the realities I've run into: Token limits are real. Even Claude 3.5's 200k context window fills up fast on anything bigger than a modest app. You'll need to chunk the codebase — feed in one feature or directory at a time — or build smarter retrieval tools (like vector search over your repo). Full-app refactors in one shot just aren't feasible yet.Hallucinations and subtle bugs happen. The agents are good, but they can invent imports that don't exist, miss edge cases in business logic, or subtly change behavior. Never merge without a thorough human diff review. In our bigger projects, we treat the AI output as a very smart PR draft, not final code.Costs add up. Running multiple agents with long contexts on a large codebase can burn through hundreds of dollars in API credits quickly. Start small and monitor usage.Non-code concerns get ignored. Package.json changes, build config, environment variables, and custom webpack setups — these agents won't touch them unless you explicitly add tools for it.It's best for mechanical refactors. Extracting components, adding routing, introducing contexts, code splitting — these are where it shines. Complex domain logic migrations or performance optimizations still need heavy human involvement.Top-tier companies are experimenting, not relying. Places like Meta, Google, and Amazon are piloting agentic workflows internally, but they're wrapping them in heavy guardrails, custom retrieval systems, and mandatory review gates. Full autonomy on critical monoliths isn't happening yet — think 30–50% productivity boost on targeted tasks, not full replacement. Use this as an accelerator, not a silver bullet. Start with one bounded feature, let the agents propose the changes, review and tweak, then expand. That's how we've gotten real wins without disasters. Wrapping Up If you're staring at a legacy 0 right now, give this approach a shot. It's not about replacing engineers — it's about letting us focus on the hard problems instead of endless boilerplate and busywork. I'd love to hear what your biggest React refactor headache is at the moment. Drop it in the comments — maybe we can figure out how to tackle it next. Happy (and much less painful) refactoring!
Agentic AI is changing how enterprises think about automation and intelligence. Agents are no longer reactive systems. They are goal-driven, context-aware, and capable of autonomous decision-making. But to operate effectively, agents must be connected to the real-time pulse of the business. This is where data streaming with Apache Kafka and Apache Flink becomes essential. Apache Flink is entering a new phase with the proposal of Flink Agents, a sub-project designed to power system-triggered, event-driven AI agents natively within Flink’s streaming runtime. Let’s explore what this means for the future of agentic systems in the enterprise. The State of Agentic AI Agentic AI is no longer experimental. It is still in the early stage of the adoption lifecycle, but it is starting to move into production for the first critical use cases. Agents today are expected to: Make real-time decisionsMaintain memory across interactionsUse tools autonomouslyCollaborate with other agents But these goals face real infrastructure challenges. Existing frameworks like LangChain or LlamaIndex are great for prototyping. But (without the help of other tools) they are not designed for long-running, system-triggered workflows that need high availability, fault tolerance, and deep integration with enterprise data systems. The real problem is integration. Agents must operate on live data, interact with tools and models, and work across systems. This complexity demands a new kind of runtime — one that is real-time, event-driven, and deeply contextual. Open Standards and Protocols for Agentic AI: MCP and A2A Standards are emerging to build scalable, interoperable AI agents. Two of the most important are: Model Context Protocol (MCP) by Anthropic: A standardized interface for agents to access context, use tools, and generate responses. It abstracts how models interact with their environment, enabling plug-and-play workflows.Agent2Agent (A2A) protocol by Google: A protocol for communication between autonomous agents. It defines how agents discover each other, exchange messages, and collaborate asynchronously. These standards help define what agents do and how they do it. But protocols alone are not enough. Enterprises need a runtime to execute these agent workflows in production — with consistency, scale, and reliability. This is where Flink fits in. Similar to microservices a decade ago, Agentic AI and protocols like MCP and A2A risk creating tight coupling and point-to-point spaghetti architectures if used in isolation. An event-driven data streaming backbone ensures these standards deliver scalable, resilient, and governed agent ecosystems instead of repeating past mistakes. The Role of Data Streaming for Agentic AI with Kafka and Flink Apache Kafka and Flink together form the event-driven backbone for Agentic AI. Apache Kafka provides durable, replayable, real-time event streams. It decouples producers and consumers, making it ideal for asynchronous agent communication and shared context.Apache Flink provides low-latency, fault-tolerant stream processing. It enables real-time analytics, contextual enrichment, complex event processing, and now — agent execution. Agentic AI requires real-time data ingestion to ensure agents can react instantly to changes as they happen. It also depends on stateful processing to maintain memory across interactions and decision points. Coordination between agents is essential so that tasks can be delegated, results can be shared, and workflows can be composed dynamically. Finally, seamless integration with tools, models, and APIs allows agents to gather context, take action, and extend their capabilities within complex enterprise environments. Apache Flink provides all of these natively. Instead of stitching together multiple tools, Flink can host the entire agentic workflow: Ingest event streams from KafkaEnrich and process data with Flink’s Table and DataStream APIsTrigger LLMs or external tools via UDFsMaintain agent memory with Flink stateEnable agent-to-agent messaging using Kafka or Flink’s internal mechanisms A2A and MCP with Apache Kafka and Flink A2A and MCP define how autonomous agents communicate and access context. With Kafka as the event broker and Flink as the stream processor, enterprises can build scalable, decoupled, and context-aware Agentic AI systems. Agents can still communicate point-to-point with each other via protocols like A2A. But their information must often reach many systems, and multiple point-to-point links create brittle architectures. Kafka solves this by acting as the scalable, decoupled event backbone. More details: Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker. FLIP-531: Building and Running AI Agents in Flink While this is already possible to implement today, Flink still needs more built-in support for standard protocols like MCP and A2A, along with native AI and ML capabilities, to fully meet the demands of enterprise-grade agentic systems. FLIP-531: Initiate Flink Agents as a new Sub-Project is an exciting Flink Improvement Proposal led by Xintong Song, Sean Falconer, and Chris Meyers. It introduces a native framework for building and running AI agents within Flink. Source: Apache Key Objectives Provide an execution framework for event-driven, long-running agentsIntegrate with LLMs, tools, and context providers via MCPSupport agent-to-agent communication (A2A)Leverage Flink’s state management as agent memoryEnable replayability for testing and auditingOffer familiar Java, Python, and SQL APIs for agent development With FLIP-531, Apache Flink goes beyond orchestration and data preparation in Agentic AI environments. It now provides a native runtime to build, run, and manage autonomous AI agents at scale. Developer Experience Flink Agents will extend familiar Flink constructs. Developers can define agents using endpoints coming soon to Flink’s Table API or DataStream API. They can connect to endpoints of large language models (LLMs), register models, call tools, and manage context — all from within Flink. Sample APIs are already available for Java, Python (PyFlink), and SQL. These include support for: Agent workflows with tools and promptsUDF-based tool invocationIntegration with MCP and external model providersStateful agent logic and multi-step workflows Roadmap Milestones The Flink Agents project is moving fast with a clear roadmap focused on rapid delivery and community-driven development: Q2 2025: MVP design finalizedQ3 2025: MVP with model support, replayability, and tool invocationQ4 2025: Multi-agent communication and example agentsLate 2025: First formal release and community expansion The team is prioritizing execution and fast iteration, with a GitHub-based development model and lightweight governance to accelerate innovation. Event-Driven Flink Agents: The Future of Always-On Intelligence The most impactful agents in the enterprise aren’t chatbots or assistants waiting for user input. They are always-on components, embedded in infrastructure, continuously observing and acting on real-time business events. Apache Flink Agents are built for this model. Instead of waiting for a request, these agents run asynchronously as part of an event-driven architecture. They monitor streams of data, maintain memory, and trigger actions automatically — similar to observability agents, but for decision-making and automation. This always-on approach is critical for modern use cases. Enterprises can’t afford delays in fraud detection, equipment failures, customer engagement, or supply chain response. Agents must act instantly — based on the data flowing through the system — not hours later via batch processing or human intervention. Apache Flink provides the ideal foundation. Its low-latency, stateful stream processing enables agents to: Observe and react to real-time signals from events, APIs, databases, or external SaaS requestsMaintain state across workflows and business processesCollaborate asynchronouslyTrigger tools, request-response APIs, or downstream actions These aren’t chatbot wrappers — they’re autonomous services embedded in production. They function as part of your business nervous system, adapting in real time to changing conditions and continuously improving outcomes. This architecture is what enterprises truly need: automation that is fast, reliable, and context-aware. It reduces time to detect and resolve issues, improves SLA adherence, and enables proactive decisions across the organization. Always-on, embedded agents are the future of AI in business. Apache Flink is ready to power them. Let's explore a few excellent use case examples across different industries in the next section. Use Cases for Agentic AI with Apache Kafka and Flink Across Industries Agentic AI use cases are emerging across industries. These systems demand real-time responsiveness, contextual intelligence, and full autonomy. Traditional REST APIs, manual orchestration, and batch jobs fall short in these environments. Instead, enterprises need infrastructure that is continuous, stateful, event-driven, and always-on (as discussed explicitly in the section above). Apache Kafka and Apache Flink already power the data backbone of many digital-native organizations. For example: OpenAI uses Kafka and Flink to build the infrastructure behind its GenAI and Agentic AI offerings.TikTok relies on Flink and Kafka for its real-time recommendation engine and online model training architecture. With Flink Agents, developers can build: Always-on and ReAct-style agents with structured workflowsRetrieval-augmented agents with semantic searchLong-lived, stateful agents with memoryFully autonomous systems with tool and API access Example Use Cases Finance Fraud detection and risk scoringCompliance monitoringAdaptive trading systems with feedback loops Manufacturing Predictive maintenance for industrial equipmentSmart factory optimizationSupply chain agents managing demand and inventory Retail Real-time product tagging and catalog enrichmentPersonalized promotions based on customer behaviorInventory rebalancing and logistics agents Healthcare Patient monitoring and alertingClaims processing and document triageCompliance audits Telecommunications Self-healing networksCustomer support automation with feedback loopsDynamic QoS optimization Gaming Adaptive AI opponents that respond to player behaviorDynamic content generation for evolving game environmentsReal-time moderation for abuse and cheating detection Public Sector Traffic and energy optimization in smart citiesAutomated citizen service assistantsPublic safety agents for emergency detection and response The Future of Agentic AI Is Event-Driven The rise of Agentic AI means a shift in infrastructure priorities. It’s not enough to invest in model quality or prompt engineering. Enterprises must also modernize their data and execution layer. Point-to-point communication between agents is fine for direct interaction, but at scale, the real value comes from an event-driven backbone like Kafka and Flink that ensures information reliably reaches all required systems. Flink Agents offer a production-grade, enterprise-ready foundation for agentic systems. They turn brittle demos into reliable applications by providing: Consistent real-time data via KafkaStateful, fault-tolerant execution via FlinkStandardized protocols via MCP and A2ADeveloper productivity via familiar APIs This combination reduces time to market, increases system resilience, and lowers operational costs. It gives developers and architects the tools to build agents like real software — scalable, testable, and observable. This shift is not a replacement for data lakes, lakehouses, or AI platforms. It complements them by enabling real-time, event-driven execution alongside batch and analytical workloads. The future of Agentic AI is event-driven. Apache Flink is ready to power it.