DZone Spotlight

Wednesday, March 11 View All Articles »

2026 Developer Research Report

By Carisse Dumaua

Hello, our dearest DZone Community! Last year, we asked you for your thoughts on emerging and evolving software development trends, your day-to-day as devs, and workflows that work best — all to shape our 2026 Community Research Report. The goal is simple: to better understand our community and provide the right content and resources developers need to support their career journeys. After crunching some numbers and piecing the puzzle together, alas, it is in (and we have to warn you, it's quite a handful)! This report summarizes the survey responses we collected from December 9, 2025, to January 27 of this year, and includes an overview of the DZone community, the stacks developers are currently using, the rising trend in AI adoption, year-over-year highlights, and so much more. Here are a few takeaways worth mentioning: AI use climbs this year, with 67.3% of readers now adopting it in their workflows.While most use multiple languages in their developer stacks, Python takes the top spot.Readers visit DZone primarily for practical learning and problem-solving. These are just a small glimpse of what's waiting in our report, made possible by you. You can read the rest of it below. 2026 Community Research ReportRead the Free Report We really appreciate you lending your time to help us improve your experience and nourish DZone into a better go-to resource every day. Here's to new learnings and even newer ideas! — Your DZone Content and Community team More

Optimizing Data Loader Jobs in SQL Server: Production Implementation Strategies

By arvind toorpu

CORE

Over the past 15 years working with SQL Server across multiple industries, I’ve seen data loading performance remain one of the most important — and most often underestimated — areas of database administration. Whether it’s nightly loads of millions of transactions, integrating data from multiple source systems, or moving terabytes of data between environments, inefficient load processes quickly lead to broader issues: missed SLAs, longer maintenance windows, outdated reports, and growing frustration from both users and leadership. The reality is that SQL Server provides a robust set of features and tools that can significantly improve data loading performance when used correctly. I’ve applied these techniques in financial services, healthcare, retail, and manufacturing environments, consistently achieving performance gains of three to ten times. In the sections that follow, I’ll walk through practical, production-tested approaches to accelerate data loading and make it a reliable part of your data platform. Choosing the Optimal Loading Method SQL Server provides multiple mechanisms for loading data, each with distinct performance characteristics. Bulk Copy Program (BCP) offers exceptional performance for loading from flat files directly into tables. At a manufacturing client, replacing a cursor-based import process with BCP reduced the daily inventory load from 2 hours to just 12 minutes. The command syntax is straightforward: PowerShell bcp Database.dbo.InventoryTable in "C:\Data\inventory.txt" -c -t, -S ServerName -T BULK INSERT provides similar performance benefits within T-SQL while integrating with stored procedures: SQL BULK INSERT Production.ProductInventory FROM '\\FileServer\Imports\Inventory.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', FIRSTROW = 2 ); SQL Server Integration Services (SSIS) delivers the best performance for complex ETL operations requiring transformations. At a healthcare provider, replacing stored procedure-based ETL with SSIS packages reduced claims processing from 4 hours to 45 minutes. Key enhancers included the SSIS data flow engine's optimized buffer management and built-in parallelism. Configuring larger DefaultBufferMaxRows and DefaultBufferSize often improves throughput, particularly on memory-rich servers: XML <package MaxConcurrentExecutables="8"> <dataflow BufferTempStoragePath="E:\SSIS_Temp" DefaultBufferMaxRows="100000" DefaultBufferSize="104857600"> </dataflow> </package> Leveraging Table Partitioning Table partitioning is one of the most powerful strategies for improving load performance while maintaining data availability. At a financial services company processing billions of monthly transactions, implementing partition switching transformed their loading process. Instead of inserting directly into the main table and blocking queries, we loaded an identical staging table and instantly switched it into the partitioned main table: MS SQL -- Create staging table with same structure as target partition SELECT * INTO Sales.Transactions_Staging FROM Sales.Transactions WHERE 1 = 0; -- Load staging table (much faster than loading production table) BULK INSERT Sales.Transactions_Staging FROM '\\FileServer\Imports\NewTransactions.csv' WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n'); -- Switch staging table into partition in an instant operation ALTER TABLE Sales.Transactions SWITCH PARTITION 12 TO Sales.Transactions_Staging; This approach reduced their loading window from 6 hours to under 30 minutes while completely eliminating impact on concurrent queries. The key advantage of partition switching is that it's a metadata operation requiring minimal logging, regardless of the data volume involved. Optimizing Table and Index Design Table design significantly influences loading performance. For frequent bulk operations, consider clustered columnstore indexes, which can improve throughput by 3–5x compared to traditional rowstore tables. At a retail client with billions of transactions, implementing columnstore reduced daily load time from 5 hours to just over 1 hour: MS SQL CREATE CLUSTERED COLUMNSTORE INDEX CCI_SalesTransactions ON Sales.Transactions; Columnstore indexes work particularly well with SQL Server's bulk load operations because they can directly build compressed column segments without creating intermediate rowstore representations. This approach yields both faster loads and dramatically better query performance. For traditional rowstore tables, consider using fill-factor settings to leave space for future inserts, particularly for tables with sequential clustering keys: MS SQL CREATE CLUSTERED INDEX IX_OrderDate ON Sales.Orders(OrderDate) WITH (FILLFACTOR = 85); At an e-commerce company, this simple adjustment reduced page splits during high-volume order processing, decreasing load times by 22% while simultaneously improving overall database performance. Managing Indexes During Loads Index maintenance represents a significant performance overhead during data loading. At a logistics company loading millions of shipment records daily, disabling non-clustered indexes before the load and rebuilding them afterward reduced their processing time from 3 hours to 40 minutes: MS SQL -- Disable indexes before loading ALTER INDEX IX_Shipment_CustomerID ON Logistics.Shipments DISABLE; ALTER INDEX IX_Shipment_DeliveryDate ON Logistics.Shipments DISABLE; -- Load data (much faster without index maintenance) BULK INSERT Logistics.Shipments FROM '\\FileServer\Imports\Shipments.csv' WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n'); -- Rebuild indexes after loading ALTER INDEX IX_Shipment_CustomerID ON Logistics.Shipments REBUILD WITH (ONLINE = ON); ALTER INDEX IX_Shipment_DeliveryDate ON Logistics.Shipments REBUILD WITH (ONLINE = ON); The ONLINE option allows queries to continue accessing the table during index rebuilds, though with some performance impact. For maintenance windows where user access isn't critical, omitting this option further accelerates the rebuild process. For loading into tables with many indexes where disabling isn't practical, consider using the SORT_IN_TEMPDB option when creating or rebuilding indexes. This distributes I/O activity across multiple file groups: MS SQL CREATE INDEX IX_Transaction_CustomerProduct ON Sales.Transactions(CustomerID, ProductID) WITH (SORT_IN_TEMPDB = ON); Implementing Minimal Logging By default, SQL Server fully logs every row inserted, which can generate significant transaction log activity during large data loads. This often becomes a major bottleneck for high-volume inserts. Using minimally logged operations can greatly reduce log usage and improve load performance. In one energy-sector environment handling data from millions of smart meters, switching to minimal logging reduced load times by approximately 65 percent. To take advantage of minimal logging, the database must be in either the BULK_LOGGED or SIMPLE recovery model for the duration of the load operation. MS SQL ALTER DATABASE EnergyMetrics SET RECOVERY BULK_LOGGED; Then, use one of the bulk loading methods with the appropriate options: MS SQL -- Using BULK INSERT with minimal logging options BULK INSERT EnergyData.HourlyReadings FROM '\\FileServer\Imports\MeterReadings.csv' WITH ( DATAFILETYPE = 'char', FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', TABLOCK -- Acquires table lock, enabling minimal logging ); The critical TABLOCK hint enables minimal logging by acquiring an exclusive table lock. While this blocks concurrent modifications, it substantially accelerates the load process. For tables with a clustered index, ensure the data is pre-sorted to match the clustering key for optimal performance. After completing the bulk load, remember to switch the database back to FULL recovery model (if needed) and perform a log backup to allow log truncation: MS SQL ALTER DATABASE EnergyMetrics SET RECOVERY FULL; BACKUP LOG EnergyMetrics TO DISK = 'E:\Backups\EnergyMetrics_Log.bak'; Optimizing Batch Size and Parallelism Finding the optimal batch size for data loading operations can dramatically impact performance. Too small, and you incur excessive overhead; too large, and you risk excessive blocking and potential errors. Through extensive testing at a healthcare analytics company, we determined that batch sizes of 50,000-100,000 rows typically delivered optimal performance for most of their loading operations. When using SSIS, you can control batch size through the BatchSize property of the destination adapter. For T-SQL operations, you can implement batching in your code: MS SQL DECLARE @BatchSize INT = 75000; DECLARE @CurrentRow INT = 0; WHILE @CurrentRow < @TotalRows BEGIN INSERT INTO TargetTable (Column1, Column2, Column3) SELECT Column1, Column2, Column3 FROM StagingTable ORDER BY ID OFFSET @CurrentRow ROWS FETCH NEXT @BatchSize ROWS ONLY; SET @CurrentRow = @CurrentRow + @BatchSize; -- Commit each batch to prevent excessive transaction log growth COMMIT; END SQL Server's degree of parallelism settings also significantly impact loading performance. At a retail analytics client, adjusting the MAXDOP setting specifically for their loading process improved throughput by 45%: MS SQL -- Configure server-wide setting appropriately EXEC sp_configure 'max degree of parallelism', 8; RECONFIGURE; -- Override for specific load operation INSERT INTO Analytics.SalesMetrics SELECT /+ MAXDOP(6) / * FROM StagingDB.RawSalesData; The optimal MAXDOP setting depends on your server's core count and concurrent workload. As a starting point, I typically set MAXDOP to half the number of physical cores for dedicated ETL servers. SQL Server Configuration Optimization Several SQL Server configuration settings directly impact loading performance. At an insurance company with an intensive nightly ETL process, we made the following adjustments: MS SQL -- Increase memory allocated to buffer pool EXEC sp_configure 'max server memory (MB)', 98304; -- Configure tempdb with multiple files ALTER DATABASE tempdb ADD FILE ( NAME = 'tempdev2', FILENAME = 'E:\SQLData\tempdb2.ndf', SIZE = 8GB, FILEGROWTH = 1GB ); -- Repeat for tempdb3, tempdb4, etc. (one per physical core, up to 8) Properly configured tempdb files eliminated allocation contention during sort operations, while increased memory allocation improved buffer cache hit ratios. These changes reduced their overall ETL process from 7 hours to just under 4 hours. For servers dedicated to data loading, consider adjusting the cost threshold for parallelism: EXEC sp_configure 'cost threshold for parallelism', 25; RECONFIGURE; The default value of 5 often leads to excessive parallelism for medium-sized operations. Increasing this threshold ensures that only truly significant operations use parallel plans. For trace flags, I've found 1117 (grow all files in a filegroup equally) and 1118 (reduce SGAM contention) particularly helpful for loading performance in versions prior to SQL Server 2016. In newer versions, these optimizations are enabled by default. Implementing Resource Governor for Controlled Loading In environments where loading must occur concurrently with regular user workloads, SQL Server's Resource Governor can prevent ETL processes from overwhelming the system while still maintaining reasonable loading performance. At a financial services client, implementing Resource Governor enabled us to run their data loading processes during business hours without impacting trader applications: MS SQL -- Create a workload group for ETL processes CREATE WORKLOAD GROUP ETLGroup WITH ( MAX_DOP = 4, -- Limit parallelism REQUEST_MAX_MEMORY_GRANT_PERCENT = 25, -- Limit memory grants REQUEST_MAX_CPU_TIME_SEC = 0, -- No CPU time limit GROUP_MAX_REQUESTS = 0 -- No request limit ); -- Create a classifier function to identify ETL connections CREATE FUNCTION dbo.ClassifyETLLoad() RETURNS sysname WITH SCHEMABINDING AS BEGIN DECLARE @GroupName sysname; IF APP_NAME() LIKE 'ETL%' OR APP_NAME() LIKE 'SSIS%' SET @GroupName = 'ETLGroup'; ELSE SET @GroupName = 'default'; RETURN @GroupName; END; GO -- Apply the classifier function ALTER RESOURCE GOVERNOR WITH (CLASSIFIER_FUNCTION = dbo.ClassifyETLLoad); ALTER RESOURCE GOVERNOR RECONFIGURE; This configuration allowed their ETL processes to run during business hours with controlled resource usage, eliminating a 6-hour nightly processing window and enabling near-real-time analytics for their trading desk. Monitoring and Continuous Optimization Effective monitoring provides critical insights for ongoing optimization. I typically create custom monitoring scripts to track key loading metrics: MS SQL SELECT r.session_id, r.status, r.command, r.wait_type, r.wait_time, r.last_wait_type, r.cpu_time, r.logical_reads, r.writes, st.text AS batch_text, qp.query_plan FROM sys.dm_exec_requests r CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) st CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) qp WHERE r.session_id > 50 -- Exclude system sessions AND r.program_name LIKE '%ETL%'; At a retail client, this monitoring revealed unexpected tempdb contention during their product catalog loads. By moving their staging tables to a separate filegroup and implementing partitioning, we reduced their loading time by 35%. Wait statistics analysis often reveals hidden bottlenecks. At an e-commerce company, examining wait types during their order processing loads showed excessive WRITELOG waits: MS SQL SELECT wait_type, wait_time_ms, waiting_tasks_count FROM sys.dm_os_wait_stats WHERE wait_type LIKE 'WRITE%' ORDER BY wait_time_ms DESC; The solution involved moving transaction log files to faster storage and implementing multiple log files on separate drives, reducing their processing time by 28%. Conclusion Improving SQL Server data loading performance requires a balanced approach that considers loading techniques, table design, index strategy, and server configuration together. In practice, the best results come from combining multiple methods and adjusting them to fit the specific workload and environment. Based on years of working with SQL Server systems, data loading optimizations consistently deliver some of the highest returns compared to other tuning efforts. The process should start with benchmarking the existing load to establish a clear baseline. From there, changes should be applied methodically, with performance measured after each adjustment. When these improvements are applied together, they can significantly reduce load times, support more frequent data refreshes, tighten service-level agreements, and enable broader analytical use cases. No two environments behave exactly the same. Techniques that work well in one system may have limited impact in another. For that reason, all changes should be validated in a development or test environment, introduced gradually into production, and continuously monitored. With disciplined testing and careful analysis, data loading can be transformed from a performance bottleneck into a reliable and scalable component of the overall data platform. More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #388

Threat Modeling Core Practices

By Apostolos Giannakidis

CORE

Refcard #401

Getting Started With Agentic AI

By Lahiru Fernando

AI in Patient Portals: From Digital Access to Intelligent Healthcare Experiences

Patient portals across mobile, web, and kiosk platforms have become the primary digital touchpoints between healthcare organizations and patients. The inception of these portals began with digitizing paper check-in forms and has evolved into full-fledged mobile and web applications that allow patients to view lab results, schedule appointments, and communicate with providers. As patient expectations rise — along with advances in consumer technology — traditional rule-based portals are no longer sufficient. This is where Artificial Intelligence (AI) is transforming patient portals from static systems into intelligent, adaptive healthcare experiences. In this article, I explore how AI is being applied in modern patient portals, like the ones in our healthcare organization, why it matters, and what engineering leaders should consider when introducing AI into healthcare-grade digital platforms. The Limitations of Traditional Patient Portals Despite widespread digital adoption, many patient portals still suffer from common issues that healthcare organizations must address: Complex navigation that frustrates users, especially elderly patients who are not familiar with technologyContinued dependence on call centers for basic questions and clarificationsFront-desk support still required for scheduling doctor appointmentsReactive engagement instead of proactive care support These challenges are not just UX problems — they directly impact patient satisfaction, clinician workload, and operational costs. AI offers a practical path forward by addressing these limitations without requiring complete platform rewrites. Where AI Fits Naturally in Patient Portals AI is beginning to fit naturally into patient portals, making them more helpful and easier to use while supporting better care delivery. Instead of static screens and long wait times for answers, AI features can respond to patient questions instantly, guide users through tasks, and provide personalized support. Explaining Complex Results For example, if a lab report shows an unfamiliar value like “eGFR: 52,” an AI-enabled portal can explain what that measurement represents and why it is monitored. It can also clarify normal ranges and suggest general next steps a patient might discuss with their provider. Simplifying Medical Terminology The portal can translate complex medical terms into easy-to-understand language. Preparing for Doctor Visits After reviewing lab results, patients might ask: “My glucose level is elevated — could that be related to my recent prescription changes?”“I’m concerned about my blood pressure. What should I ask my doctor about medications or lifestyle changes?” AI can help generate relevant questions so patients arrive better prepared. Scheduling Follow-Up Care AI-enabled portals can present multiple appointment options and alternative suggestions to help patients quickly book convenient times. Intelligent Virtual Assistants Intelligent virtual assistants go beyond traditional chatbots. These AI-powered assistants embedded within patient portals can handle: Appointment scheduling and reschedulingPrescription refill guidanceInsurance and billing-related questionsPre-visit instructions and reminders Personalized Patient Experiences Every patient’s journey is different. AI enables portals to move from static dashboards to context-aware personalization, such as: Highlighting relevant actions based on recent visitsAdjusting content based on chronic conditionsSurfacing reminders aligned with care plansDelivering personalized education materials This level of personalization improves engagement without overwhelming patients with unnecessary information. Predictive Engagement and Proactive Care AI models can analyze historical interaction data to identify patterns such as: Missed appointmentsDelayed follow-upsGaps in preventive care Using these insights, patient portals can proactively nudge patients at the right time and through the right channel, reducing no-shows and improving adherence. Clinical Workflow Support The goal is not to replace clinicians. Instead, AI within patient portals can assist them indirectly by: Structuring symptom inputs before visitsSummarizing patient-submitted messagesFlagging high-priority requestsReducing administrative burden This allows care teams to focus on clinical decision-making while AI handles triage support — without crossing into unsafe automation. Engineering Considerations for AI-Driven Patient Portals Engineering considerations are critical when implementing AI in patient portals to ensure optimized healthcare delivery and patient engagement. A primary focus must be data security and patient trust. Data Privacy and Trust Are Non-Negotiable Healthcare AI must be designed with: HIPAA-compliant data handlingExplicit consent boundariesAuditability and traceabilityClear patient communication about AI usage Architecture Matters More Than Algorithms In real-world patient portals, AI works best when built as decoupled, service-oriented components — often using event-driven or serverless architectures. This approach enables: Independent iteration of AI capabilitiesSafe rollback of featuresControlled exposure to web and mobile clientsBackward compatibility with existing systems Measuring Success The success of AI in patient portals should not be measured by model complexity, but by real-world outcomes such as: Reduced call-center volumeImproved appointment adherenceFaster response timesHigher patient satisfaction scoresLower clinician burnout The Road Ahead AI will not replace patient portals — but it will redefine the patient experience. Future portals will function less like digital filing cabinets and more like intelligent care companions, helping patients navigate healthcare systems that are often fragmented and overwhelming. For healthcare organizations, the challenge is not whether to adopt AI, but how to do so responsibly, securely, and incrementally — without compromising trust or safety. When implemented thoughtfully, AI has the potential to make patient portals not just more efficient, but genuinely more human. Let’s not be afraid — instead, let’s be bold and embrace the evolution of technology to advance our industry and our profession.

By Muhammed Harris Kodavath

The Inner Loop Is Eating The Outer Loop

For as long as most of us have been building software, there has been a clean split in the development lifecycle: the inner loop and the outer loop. The inner loop is where a developer lives day to day. Write code, run it locally, check if it works, iterate. It is fast, tight, and personal. The outer loop is everything after you push. Continuous integration pipelines, integration tests, staging deployments, and code review. It is comprehensive but slow, and for good reason. Running your entire test suite against every keystroke would be insane. So we optimized: fast feedback locally, thorough validation later. This split was not some grand architectural decision. It was a pragmatic response to a real constraint. Comprehensive validation testing against real dependencies in a realistic environment was slow and expensive. So developers made a tradeoff to sacrifice thoroughness for speed in the inner loop and defer the real testing to continuous integration (CI). Write a unit test, mock a dependency or two, and move on. The comprehensive stuff runs later, in a pipeline, and you deal with failures when they show up. Sometimes hours later. Sometimes the next day. That tradeoff only made sense when we had no alternative. Now, the model is evolving into a single loop where validation happens at every stage of the software development lifecycle (SDLC). The Constraint That Created Two Loops Is Breaking The inner and outer loop split was never about two fundamentally different kinds of work. It was about a limitation: you could not perform comprehensive validation fast enough to be part of the development loop. Integration testing meant spinning up services, provisioning databases, and waiting for environments. That was a 15-minute-to-hours proposition, not a seconds proposition. So it got batched into CI. Now, infrastructure has caught up. Ephemeral environments can spin up in seconds, giving you real integration testing against actual dependencies on a branch, pre-merge. There is no wait. The technical barrier to comprehensive but fast validation is gone. Continuous Delivery Becomes Practical for Everyone The idea of pushing smaller units of code to production more frequently is not new, but most teams still struggle to pull it off in distributed, cloud-native architectures. In a microservices architecture, testing a small change properly means validating it against multiple downstream consumers. Historically, this meant slow environment provisioning, waiting in a queue for a staging spot, or relying on mocked dependencies. To cope, teams batched changes, running massive integration suites nightly or weekly. When something broke, debugging spanned days of commits. With access to fast, comprehensive ephemeral environments, continuous delivery becomes highly practical. A developer can make a focused change, spin up a sandbox that routes traffic through the modified service, validate against real dependencies in seconds, and push it forward. The per-change cost of validation drops low enough that batching becomes unnecessary. Debugging is vastly simplified because the blast radius is limited to a single small, well-understood change. Ultimately, the path from code written to running in production shrinks from days to hours. For Agents, Fast Validation Is a Critical Infrastructural Change This merging of the loops is an exciting evolution for software development as a whole. But for teams implementing agentic workflows at scale, it is a structural necessity. Agents are now writing most of our code, and they have a very different relationship with validation than humans do. Fast feedback is not a preference for agents. It is essential. An agent does not get frustrated waiting for tests, but the speed and fidelity of feedback directly impact what an agent can accomplish. An agent that can validate a change against real services in 10 seconds will iterate 30 times in the window, whereas an agent waiting on a five-minute environment spin-up iterates once. Speed is not just a quality of life thing for agents. It is a throughput multiplier. Humans traded thoroughness for speed because those two things were in tension. You could have fast but shallow local mocked tests or slow but thorough CI integration tests. Pick one. In fast, ephemeral environments, agents do not face that trade-off. They get comprehensive validation at inner-loop speed. They can test against real dependencies, real services, and real data flows to validate the behavior of their changes in seconds. What Agents Do With Fast, Comprehensive Environments When an agent picks up a coding task with access to the right environment and tools, the workflow looks nothing like the old inner and outer loop divide. The agent writes code, then validates it. It does not use mocked unit tests, but rather tests against real dependencies in an ephemeral environment that spins up in seconds. It finds a problem, fixes it, and validates again. It might run through this cycle dozens of times before a PR ever exists. Each iteration is both fast and thorough. Then the agent goes further. It reviews its own code, or has another agent review it. It checks edge cases. It verifies that the change works correctly within the broader dependency graph. All of this happens on a branch, pre-merge, in seconds per cycle. By the time anything gets pushed toward main, it has already been through a level of validation that most traditional pipelines would envy. The outer loop has very little left to catch, allowing CI to act as a lightweight, continuous feedback mechanism for the agent rather than a heavy, delayed gatekeeper. Code Review Gets Absorbed Into the Workflow Here is another piece of the outer loop that is collapsing inward: code review. Agentic code review is quickly becoming standard. But the interesting shift is not just that AI can review code. It is that the review becomes part of the agent's own development loop rather than a separate phase. An agent writes code, validates it in a sandbox, reviews the change, addresses issues, and re-validates. Only then does it create a PR. By the time a developer sees a PR, if they need to see it at all, the mechanical quality issues are already resolved. The PR becomes less of a gate to check work and more of a record of what was done, how it was validated, and the evidence that it works. Developer review does not disappear entirely. Architecture decisions, security-sensitive changes, and novel approaches still benefit from human judgment. But the outer-loop review bottleneck, where PRs sit in a queue waiting for an overloaded engineer to context-switch into reviewer mode, largely goes away. The Tooling Ceiling Becomes the Agent Ceiling If this thesis is right, and if the inner loop really is absorbing the outer loop, it creates a very clear bottleneck. The quality of environments and tools available to the agent. An agent with only local unit testing will catch local bugs. Give it access to fast ephemeral environments with real dependency graphs, and it catches integration issues, configuration drift, and behavioral regressions. Give it access to performance benchmarks, security scanners, and observability data, and it catches even more. This shifts where the highest-leverage infrastructure investment is. Instead of building more elaborate post-merge CI pipelines, the winning bet is making comprehensive, realistic validation available pre-merge. It must be fast enough and cheap enough that agents can use it on every iteration, not just on PR submission. Conclusion The organizations that figure this out first and invest in giving agents fast, comprehensive, pre-merge validation will be the ones that actually achieve continuous delivery. With validation happening continuously, the outer loop becomes part of the inner loop. CI becomes more lightweight, serving as one of several layers of validation and feedback in a true continuous delivery flow. The inner loop is merging with the outer loop. The question is not whether this shift is happening. It is whether your validation tooling is ready for it.

By Arjun Iyer

Failure Handling in AI Pipelines: Designing Retries Without Creating Chaos

Retries have become an integral part of the AI tools or systems. In most systems I have seen, teams usually approach failures with blanket retrying. This often yields duplicate work, cost spikes, wasted compute, and operational instability. Every unnecessary retry triggers another inference call, an embedding request, or a downstream write, without improving the outcome. In most early-stage AI tools, the pattern is that if a request fails, a retry is added. If the retry succeeds intermittently, then the logic is considered sufficient. This approach works fine until the application is in the test environment or in low-user-usage mode; as soon as the application sees higher traffic and concurrent execution, retries begin to dominate system behavior. The consequences like these become visible: Increased token usage and costInconsistent latencyRepeated processing of the same JobWorkers look busy, but the queues are not drainingNon-meaningful logs To avoid these consequences, AI tools must treat failures as structured states and respond appropriately to their nature. At a minimum, failures should be categorized into 3 broad categories. 1. Transient (Retryable) Temporary failures should be retried with appropriate backoff. For example, timeouts, HTTP 429 rate limits, 5xx upstream errors, short-lived network interruptions, etc. 2. Permanent (Non-Retryable) For these, retries won't change the outcome and should not be retried. For example, invalid payload, schema mismatch, missing required fields, authentication errors, incorrect model configuration, API key failures, policy violations, etc. 3. Unknown (Quarantine) Any failures that cannot be confidently classified into the two categories above should be marked as unknown. These should not be retried indefinitely. These require controlled handling, often through quarantine or dead-letter routing. For example, inconsistent upstream data, unexpected response structures, edge cases, exceptions, etc. Let's understand this with a real-world AI application. Consider an AI-based data enrichment workflow inside a multi-tenant SaaS platform. A typical job within this workflow is structured as: Step 1: The system receives source dataStep 2: An LLM is invoked to normalize or enrich selected fields.Step 3: The enriched output is written to a database.Step 4: An event is emitted for downstream indexing or analytics. This flow appears to be straightforward. The complexity arises when any of the individual step fails. These complexities can be anything. The ideal response to these complexities should depend on the nature of the failure. A few examples of these complexities are: LLM returns a 429 rate-limit response. In this case, the workflow should retry with bounded backoff.LLM returns a 503 temporary outage. In this case, retrying may also be reasonable.Payload is missing a required field, such as title. In this case, retrying will not resolve the issue; the job should be marked failed with a clear reason.Tenant configuration lacks a required model name. In this case, it is a configuration error rather than a transient failure, so no retry is needed.Database write times out. In this case, retry behavior depends on idempotency guarantees and write semantics. Simple and Powerful Production-Friendly Failure Model We should have failure records that operators can read and understand. For example: JSON { "job_id": "job_84721", "tenant_id": "tenant_A", "stage": "LLM_CALL", "status": "FAILED", "failure_type": "TRANSIENT", "reason": "RATE_LIMIT", "http_status": 429, "attempts": 3, "next_action": "RETRY", "timestamp": "2026-02-12T16:10:00Z" } To understand it better, let’s look at this code defining failure classification and retry policy. Step 1: Defining the Failure Types and Classification Python import random import time from dataclasses import dataclass from typing import Optional, Dict, Any class FailureType: TRANSIENT = "TRANSIENT" # retryable NON_RETRYABLE = "NON_RETRYABLE" UNKNOWN = "UNKNOWN" @dataclass class Failure: failure_type: str reason: str http_status: Optional[int] = None detail: Optional[str] = None def classify_failure(err: Exception, http_status: Optional[int] = None) -> Failure: """ Classify failures into TRANSIENT / NON_RETRYABLE / UNKNOWN. Keep this logic small and explicit. """ # Common transient HTTP statuses if http_status in (408, 429, 500, 502, 503, 504): reason = "RATE_LIMIT" if http_status == 429 else "UPSTREAM_UNAVAILABLE" return Failure(FailureType.TRANSIENT, reason, http_status=http_status) # Auth/config errors are usually permanent until fixed if http_status in (401, 403): return Failure(FailureType.NON_RETRYABLE, "AUTH_OR_PERMISSION", http_status=http_status) # Bad request / schema problems are usually permanent if http_status in (400, 404, 422): return Failure(FailureType.NON_RETRYABLE, "BAD_REQUEST_OR_SCHEMA", http_status=http_status) # Known local validation errors if isinstance(err, ValueError): return Failure(FailureType.NON_RETRYABLE, "INPUT_VALIDATION", detail=str(err)) # Everything else: quarantine unless you have a reason to retry return Failure(FailureType.UNKNOWN, "UNCLASSIFIED_EXCEPTION", detail=str(err)) Step 2: Retry Policy With Exponential Backoff and Jitter Python @dataclass class RetryPolicy: max_attempts: int = 5 base_delay_sec: float = 0.5 # initial delay max_delay_sec: float = 15.0 # cap jitter_ratio: float = 0.2 # +/- 20% randomness def compute_backoff(policy: RetryPolicy, attempt: int) -> float: # Exponential backoff: base * 2^(attempt-1), capped delay = min(policy.base_delay_sec * (2 ** (attempt - 1)), policy.max_delay_sec) # Add jitter to avoid synchronized retries jitter = delay * policy.jitter_ratio return max(0.0, delay + random.uniform(-jitter, jitter)) Step 3: A Wrapper That Applies Classification and Policy Python def run_with_failure_handling( *, job_id: str, tenant_id: str, stage: str, policy: RetryPolicy, fn, fn_kwargs: Dict[str, Any] ) -> Dict[str, Any]: """ Runs a single stage (e.g., LLM call) with: - classification - bounded retries - backoff + jitter """ last_failure: Optional[Failure] = None for attempt in range(1, policy.max_attempts + 1): try: return fn(**fn_kwargs) except Exception as e: # If your fn can provide http_status, pass it in explicitly. http_status = getattr(e, "http_status", None) failure = classify_failure(e, http_status=http_status) last_failure = failure # Decide what to do next if failure.failure_type == FailureType.NON_RETRYABLE: return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": failure.failure_type, "reason": failure.reason, "http_status": failure.http_status, "attempts": attempt, "next_action": "STOP" } if failure.failure_type == FailureType.UNKNOWN: # Conservative choice: do not retry unknown failures forever. # Quarantine after 1 attempt (or 2 if you prefer). return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": failure.failure_type, "reason": failure.reason, "http_status": failure.http_status, "attempts": attempt, "next_action": "QUARANTINE" } # Transient: retry if attempts remain if attempt < policy.max_attempts: delay = compute_backoff(policy, attempt) time.sleep(delay) continue # Ran out of attempts return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": failure.failure_type, "reason": failure.reason, "http_status": failure.http_status, "attempts": attempt, "next_action": "DLQ" } # Should not reach here, but return last known state return { "job_id": job_id, "tenant_id": tenant_id, "stage": stage, "status": "FAILED", "failure_type": (last_failure.failure_type if last_failure else FailureType.UNKNOWN), "reason": (last_failure.reason if last_failure else "UNKNOWN"), "attempts": policy.max_attempts, "next_action": "DLQ" } Failure Handling and Idempotency Failure handling and idempotency are a pair. Idempotency prevents duplicates from retries, whereas failure handling prevents retries from becoming chaotic. If the retry logic is aggressive and jobs are not idempotent, the usage cost will be high, as there will be duplicate inference calls and duplicate DB writes, leading to a confusing state. If the retry logic is disciplined and jobs are idempotent, the system becomes predictable: retries resolve to state checks, only one execution wins, and operators can reprocess failures intentionally. Closing Thoughts In summary, retries are not the enemy for any AI tool; uncontrolled retries are. A production-grade AI tool shouldn’t just retry because of failure; it should understand why the job failed and should retry with discipline when retry proves to be beneficial and stops when it doesn’t.

By Aditya Gupta

Software Testing in LLMs: The Shift Towards Autonomous Testing

I wanted to unpack a simple, clear reality on intelligent testing in the large language models (LLM) era. LLMs redefine software testing principles by accelerating intelligent testing across the entire SDLC, enabling autonomous test generation, self-verifying AI agents, and true shift-left quality across build and deployment pipelines. Why Are LLMs a Testing Game-Changer? The "why" cuts to the heart of testing's oldest challenges: People write tests. People maintain flaky scripts. People explore complex systems. These tasks are deeply rooted in language (specifications, bug reports, code) and reasoning (what to test next, why something failed). LLMs have learned the patterns of code, natural language, and logical discourse from a vast corpus of human knowledge. They can now participate in the intellectual work of testing. If my test suite is a sprawling, fragile beast, an LLM can help me refactor it. If I'm faced with a new, undocumented API, an LLM can help me explore it and hypothesize test scenarios. If a CI pipeline fails at 2 AM with a cryptic error, an LLM can triage it. We're not using them to replace testers, but to augment our cognitive reach. They automate the translation of thought into action, turning a risk idea into a test script, a failure trace into a diagnosis. This frees us to focus on higher-order strategy: designing better test oracles, understanding system risk, and guiding truly autonomous testing agents. That's the game-changer. What Is an LLM in a Testing Context? Let's start with a core notion: In testing, an LLM is a reasoning engine for quality. Forget the chatbot box. Think of it as a new kind of testing tool. It doesn't "know" your application. It doesn't "understand" quality in a human sense. Instead, it has learned a statistical map of how concepts like "login test," "boundary value," "race condition," or "XPath selector" relate to billions of lines of code, bug reports, and testing tutorials. I have to ask: How can it create a valid test if it's never seen my app? This is the shift. It's not recalling a specific test. It's synthesizing a new one by following the patterns of what test code, logical steps, and descriptive language look like. When you prompt, "Write a Playwright test for a login flow that includes an invalid password attempt," it predicts the most probable sequence of code tokens and actions that matches that request, much like a senior tester drawing on a lifetime of experience to draft a new test case. The tester's role evolves from authoring every script to orchestrating and validating the output of this reasoning engine. The LLM becomes a force multiplier. How We "Program" This Testing Engine: The New Art of the Test Prompt Our primary interface is the prompt. This is where testing skill meets AI interaction. My initial model was simple: "Write a test for X." But I learned by doing, just like exploratory testing. For example: Weak prompt:"Test the checkout page." This prompt gets a generic, likely useless script.Context-rich prompt:"Act as a security-focused QA. Given this HTML snippet of our checkout form, identify three key risks for a fraudulent transaction. For the top risk, generate a Puppeteer script that demonstrates it. Assume the card number field uses custom validation." The output from this prompt is targeted, insightful, and actionable. I'm not just asking; I'm setting a testing mission. I provide context (HTML, user stories), assign a testing role ("performance engineer," "accessibility auditor"), specify techniques ("use equivalence partitioning"), and demand a specific output format. This is meta-testing. I compare the LLM's output to my mental model of good testing. I refine, iterate, and guide. The prompt becomes the test charter for an AI co-pilot. From Automation to Autonomy: The Evolving "Models" of Testing LLMs are introducing new layers into our testing architecture: The Script Generator: This is entry-level. Translating natural language descriptions into executable test code (Selenium, Playwright, Cypress). It kills boilerplate.The Intelligent Explorer: Here's where autonomy begins. An LLM-powered agent explores applications via Model Context Protocol (MCP), an open standard connecting AI models to external tools and data for better responses. It clicks, observes, infers state, and decides next steps dynamically. "This looks like a data grid; let's test sorting and filtering", mimicking exploratory testing at machine speed.The Analyst and Diagnostician: This is crucial. When a test fails, the LLM can analyze the stack trace, logs, video, and DOM snapshot. It can hypothesize the root cause: "The element wasn't found because a loading overlay is still present. The script needs an explicit wait for the overlay to disappear." It turns CI/CD failures into actionable insights.The Adaptive Test Manager: The future is systems where LLMs don't just write and run tests, but manage them. They can prioritize tests based on code changes, cluster similar failures, suggest flakiness fixes, and even generate "tests for your tests" to improve coverage. What Does "Testing" Become in This Era? The practice is splitting, much like the shift from manual to automated testing before it: LLM-augmented scripted testing: Enhancing traditional automation. "Maintain this test suite," "Convert these 100 manual test cases into API tests," "Generate performance test data." It's about scale and efficiency.LLM-driven exploratory testing: This is the frontier. Here, the tester defines a mission and constraints, and an LLM-powered agent executes a unique, adaptive exploration path. Each session is different. The tester's job is to analyze the agent's findings, refine the mission, and build new models. It's a collaborative, investigative loop. New Testing Techniques for the LLM Era New skills are emerging: Prompt engineering for testing: This is the new test case design. Being precise about scope, context, risk, and expected output format.Context engineering: Using retrieval-augmented generation (RAG) to ground the LLM in your specific context, your codebase, your bug database, your API docs. This turns a generic LLM into a domain expert on the system.Orchestration and validation: Designing the systems and guardrails that let LLM agents operate safely. Writing the "tests for the AI tester" and validating its outputs is now a critical testing activity. Conclusion This is a high-level map of the changing testing landscape. The key takeaways: LLMs are reasoning allies that translate testing intuition into action at unprecedented scale.The tester's role is shifting from sole executor to strategic orchestrator and validator of AI-assisted processes.The goal is evolving from automated execution (running scripts) to augmented intelligence (LLM-powered exploration) and, ultimately, toward guided autonomy (self-adapting test systems).The core of testing remains: critical thinking, risk assessment, and a relentless curiosity about the system. LLMs provide a powerful new lens through which to apply that thinking. Just as software testing has always been about learning the reality of the system, testing in the LLM era is about learning to partner with a new kind of intelligence. We build a shared model, test its boundaries, and evolve together. And that's what software testing is becoming.

By Kathiresan Jayabalan

Best Practices to Make Your Data AI-Ready

The key problem organizations encounter when implementing AI is not the technology itself, but the data needed to feed AI models. Many companies have plenty of data, but when it comes to quality, it often turns out to be messy, inconsistent, or biased. If you want your AI investments to deliver real value, you must make your data AI‑ready first. Below, I share some best practices for building an AI-ready culture and establishing a data management framework that ensures high-quality data pipelines for AI initiatives. Start with Understanding Which Data You Need AI readiness begins with use cases. You need to understand what type and how much data you require to build an efficient data analytics platform. Start by defining how AI will change a specific process, decision, or metric for your company. A good AI data strategy aligns data usage with business goals. This approach prevents you from investing time and resources in cleaning data you won’t use. Trust me, it can greatly optimize costs for your AI projects. Once you have defined your use cases, you need to specify the exact data requirements, including formats, fields, latency, and more. A common mistake I see is making vague statements instead of focused specifications. For example, “customer data” is too broad; it’s better to divide it into specific fields like “customer ID,” “email address,” and “signup date.” This makes validation concrete and automatable. Build Strong Data Governance and Ownership One thing I know for sure is that AI projects fail fast if no one owns the data quality process. You need someone in your organization accountable for field definitions, data catalogs, access policies, and quality metrics. Without clear ownership, data changes often go unnoticed. Governance should also enforce role‑based access, encryption standards, and lineage tracking so that data is traceable from source to model input. These factors help you comply with policies like GDPR while also reducing risk in AI decision-making. Use Metadata and Catalogs to Make Data Discoverable Metadata helps you quickly understand what each dataset contains, how it was created, and how it changes over time. This makes data easy to find for analysts and AI engineers. Build or use a data catalog that: indexes tables, schemas, and fields;documents ownership and definitions;tracks lineage and usage Metadata catalogs also serve as the basis for trust and reproducibility. When someone knows exactly where a dataset came from and how it has been transformed, they can validate that the model is working with reliable inputs. Maintain a Central Data Platform Data silos are a common problem for most organizations. Implementing data analysis in healthcare, I experienced this firsthand. Data tied up in departmental systems slows discovery and increases fragmentation. I don’t say that you need the “everything goes here” system. This would be risky. But you need a data management layer that allows you to find, query, and monitor data from a single place. Think of it like a shared library. Start by registering your most critical datasets, but not everything at once. Document ownership, field definitions, refresh frequency, and known quality issues. Standardize access through shared query interfaces, whether teams use SQL, APIs, or other tools. Also, build quality checks directly into pipelines, adding validation rules for freshness, completeness, and schema changes at ingestion. Track and Improve Quality Continuously AI models require fresh data to retrain, so ensuring data quality is an ongoing process. Automate checks and set thresholds that trigger alerts. This allows your team to intervene before issues become costly problems. If a pipeline breaks or a critical field starts missing values, you should know before a model retrains on bad data. Once models are live, monitor their outputs and link them back to data quality signals. If a model consistently makes errors tied to certain data fields, trace the issue back and fix it upstream. Test AI Readiness Before Full Deployment Implementing AI iteratively has become best practice. The same applies to testing data for AI readiness. Before committing to full production, run small pilot projects to validate that data quality is sufficient and measure whether the dataset actually supports the business use case. In one project I worked on, we tried to build an employee attrition model using HR system data and moved too quickly toward implementation. We assumed core fields like job level, manager ID, and role history were reliable. During model testing, we realized that role changes were overwritten instead of tracked over time. As a result, the model learned misleading patterns. We had to step back, redesign the data model, and introduce proper history tracking before continuing. Pilot tests like this help catch gaps and adjust quality standards without significant risk. Wrapping Up AI success depends on data that is complete, accurate, and structured. Models trained on partial or inconsistent data will perform poorly and produce misleading results. In this article, I intentionally didn’t focus on cleaning and preparing a specific dataset, but rather on building a framework for effective data management in organizations pursuing AI projects. To see real results from your AI initiatives, ensure a consistent and reliable data flow. This reduces costly errors and transforms data into a strategic asset rather than just a byproduct of operations.

By Mykhailo Kopyl

Square, SumUp, Shopify: Data Streaming for Real-Time Point-of-Sale (POS)

Point-of-Sale (POS) systems are no longer just cash registers. They are becoming real-time, connected platforms that handle payments, manage inventory, personalize customer experiences, and feed business intelligence. Small and medium-sized merchants can now access capabilities once reserved for enterprise retailers. Mobile payment platforms like Square, SumUp, and Shopify make it easy to sell anywhere and integrate sales channels seamlessly. At the same time, data streaming technologies such as Apache Kafka and Apache Flink are transforming retail operations. They enable instant insights and automated actions across every store, website, and supply chain partner. This post explores the current state of mobile payment solutions, the role of data streaming in retail, how Kafka and Flink power POS systems, the SumUp success story, and the future impact of Agentic AI on the checkout experience. Mobile Payment and Business Solutions for Small and Medium-Sized Merchants The payment landscape for small and medium-sized merchants has undergone a rapid transformation. For years, accepting card payments meant expensive contracts, bulky hardware, and complex integration. Today, companies like Square, SumUp, and Shopify have made mobile payments simple, mobile, and affordable. Block (Square) offers a unified platform that combines payment processing, POS systems, inventory management, staff scheduling, and analytics. It is especially popular with small retailers and service providers who value flexibility and ease of use. SumUp started with mobile card readers but has expanded into full POS systems, online stores, invoicing tools, and business accounts. Their solutions target micro-merchants and small businesses, enabling them to operate in markets that previously lacked access to digital payment tools. Shopify integrates its POS offering directly into its e-commerce platform. This allows merchants to sell in physical stores and online with a single inventory system, unified analytics, and centralized customer data. These companies have blurred the lines between payment providers, commerce platforms, and business management systems. The result is a market where even the smallest shop can deliver a payment experience once reserved for large retailers. Data Streaming in the Retail Industry Retail generates more event data every year. Every scan at a POS, every online click, every shipment update, and every loyalty point redemption is a data event. In traditional systems, these events are collected in batches and processed overnight or weekly. The problem is clear: by the time insights are available, the opportunity to act has often passed. Data streaming solves this by making all events available in real time. Retailers can instantly detect low stock in a store, trigger replenishment, or offer dynamic discounts based on current shopping patterns. Fraud detection systems can block suspicious transactions before completion. Customer service teams can see the latest order updates without contacting the warehouse. In previous retail industry examples, data streaming has powered: Omnichannel inventory visibility for accurate stock counts across stores and online channels.Dynamic pricing engines that adjust prices based on demand and competitor activity.Personalized promotions triggered by live purchase behavior.Real-time supply chain monitoring to handle disruptions immediately. Emerging Trend: Unified Commerce The next stage beyond omnichannel is Unified Commerce. Here, all sales channels — physical stores, online shops, mobile apps, marketplaces, and social commerce — operate on a single, real-time data foundation. Instead of integrating separate systems after the fact, every transaction, inventory update, and customer interaction flows through one unified platform. Data streaming technologies like Apache Kafka make Unified Commerce possible by ensuring all touchpoints share the same up-to-date information instantly. This enables consistent pricing, seamless cross-channel returns, accurate loyalty balances, and personalized experiences no matter where the customer shops. Unified Commerce turns fragmented retail technology into a single, connected nervous system. Data Streaming with Apache Kafka and Flink for POS in Retail In an event-driven retail architecture, Apache Kafka acts as the backbone. It ingests payment transactions, inventory updates, and customer interactions from multiple channels. Kafka ensures these events are stored durably, replayable for compliance, and available to downstream systems within milliseconds. Apache Flink adds continuous stream processing capabilities. For POS use cases, this means: Running fraud detection models in real time, with alerts sent instantly to the cashier or payment gateway.Aggregating sales data on the fly to power live dashboards for store managers.Updating loyalty points immediately after a purchase to improve customer satisfaction.Ensuring that both physical stores and e-commerce channels reflect the same stock levels at all times. Together, Kafka and Flink create a foundation for operational excellence. They enable a shift from manual, reactive processes to automated, proactive actions. Using data streaming at the edge for POS systems enables ultra-low latency processing and local resilience, but scaling and managing it across multiple locations can be challenging. Running data streaming in the cloud offers central scalability and simplified governance, though it depends on reliable connectivity and may introduce slightly higher latency. SumUp: Real-Time POS at Global Scale with Data Streaming in the Cloud SumUp processes millions of transactions per day across more than 30 countries. To handle this scale and maintain high availability, they adopted an event-driven architecture powered by Apache Kafka and fully managed Confluent Cloud. In the Confluent customer story, SumUp explains how Kafka has allowed them to: Process every payment event in real time.Maintain a unified data platform across regions, ensuring compliance with local payment regulations.Scale easily to handle seasonal transaction spikes without service interruptions.Speed up developer delivery cycles by providing event data as a service across teams. Implementing Critical Use Cases Across the Business More than 20 teams at SumUp now rely on Confluent Cloud to deliver mission-critical capabilities. Global Bank Tribe: Operates SumUp’s banking and merchant payment services. Real-time data streaming keeps transaction records updated instantly in merchant accounts. Reusable data products improve resilience for high-volume processes such as 24/7 monitoring, fraud detection, and personalized recommendations.CRM Team: Delivers customer and product information to operational teams in real time. Moving away from batch processing creates a smoother customer experience and enables data sharing across the organization.Risk Data and Machine Learning Platform: Feeds standardized, near-real-time data into machine learning models. These models make decisions on the freshest data available, improving outcomes for both teams and merchants. By embedding Confluent Cloud across multiple domains, SumUp has turned event data into a shared asset that drives operational efficiency, customer satisfaction, and innovation at scale. For merchants, this means faster transaction confirmations, improved reliability, and new digital services without downtime. The Future of POS and Impact of Agentic AI The POS of tomorrow will be more than a payment device. It will be a connected intelligence hub. Agentic AI, with autonomous systems capable of proactive decision-making, will play a central role. Future capabilities could include: AI-driven recommendations for upsells, customized to each shopper’s behavior and context.Predictive inventory replenishment that automatically places supplier orders when stock is low.Automated fraud prevention that adapts in real time to emerging threats.Dynamic loyalty program offers tailored at the exact moment of purchase. When Agentic AI is powered by real-time event data from Kafka and Flink, decisions will be both faster and more accurate. This will shift POS systems from passive endpoints to active participants in business growth. For small and medium-sized merchants, this evolution will unlock capabilities previously available only to enterprise retailers. The result will be a competitive, data-driven retail landscape where agility and intelligence are built into every transaction.

By Kai Wähner

CORE

Hands-On With Kubernetes 1.35

Kubernetes 1.35 was released on December 17, 2025, bringing significant improvements for production workloads, particularly in resource management, AI/ML scheduling, and authentication. Rather than just reading the release notes, I decided to test these features hands-on in a real Azure VM environment. This article documents my journey testing four key features in Kubernetes 1.35: In-place pod vertical scaling (GA)Gang scheduling (Alpha)Structured authentication configuration (GA)Node declared features (Alpha) All code, scripts, and configurations are available in my GitHub repository for you to follow along. Test Environment Setup: Cloud: Azure VM (Standard_D2s_v3: 2 vCPU, 8GB RAM)Kubernetes: v1.35.0 via MinikubeContainer runtime: containerdCost: ~$2 for full testing sessionRepository: k8s-135-labs Why Azure VM instead of local? Testing on cloud infrastructure provides production-like conditions and helps identify real-world challenges you might face during deployment. Feature 1: In-Place Pod Vertical Scaling (GA) Theory: The Resource Management Problem Traditional Kubernetes pod resizing has a critical limitation: it requires pod restart. Old Workflow: User requests more CPU for podPod must be deletedNew pod created with updated resourcesApplication downtimeState lost (unless persistent storage) For production workloads, this causes: Service interruptionsLost in-memory stateLonger scaling timesComplex orchestration needed What's New in K8s 1.35 In-place pod vertical scaling (now GA) allows resource changes without pod restart: YAML apiVersion: v1 kind: Pod spec: containers: - name: app resources: requests: cpu: "500m" memory: "256Mi" limits: cpu: "1000m" memory: "512Mi" resizePolicy: - resourceName: cpu restartPolicy: NotRequired # No restart for CPU! - resourceName: memory restartPolicy: RestartContainer # Memory needs restart Key innovation: Different restart policies for different resources. CPU changes typically don't require restart, while memory might. Hands-On Testing Repository: lab1-in-place-resize I created an automated demo script that simulates a real-world scenario: Scenario: Application scaling up to handle increased load Initial (Light Load): 250m CPU, 256Mi memoryTarget (Peak Load): 500m CPU, 1Gi memoryIncrease: 2x CPU, 4x memory Shell # Run the automated demo ./auto-resize-demo.sh Auto-resize script output showing 250m →500m and Memory 256Mi → 1Gi Results: CPU doubled (250m → 500m) without restartMemory quadrupled (256Mi → 1Gi) without restartRestart count: 0Total time: 20 seconds Critical Discovery: QoS Class Constraints During testing, I encountered an important limitation that's not well-documented: The error: Plain Text The Pod "qos-test" is invalid: spec: Invalid value: "Guaranteed": Pod QOS Class may not change as a result of resizing QoS error message when trying to resize only requests What I learned: Kubernetes has three QoS classes: Guaranteed: requests = limitsBurstable: requests < limitsBestEffort: no requests/limits The rule: In-place resize cannot change QoS class. Wrong (fails): YAML # Initial: Guaranteed QoS requests: { cpu: "500m" } limits: { cpu: "500m" } # Resize attempt: Would become Burstable requests: { cpu: "250m" } limits: { cpu: "500m" } # QoS change! Correct (works): YAML # Resize both proportionally requests: { cpu: "250m" } limits: { cpu: "250m" } # Stays Guaranteed Production Impact Before K8s 1.35: Plain Text Monthly cost for 100 Java pods: - Startup: 2 CPUs × 5 minutes = wasted during idle - Scaling event: Pod restart required - Result: Over-provisioned or frequent restarts After K8s 1.35: Plain Text Monthly cost for 100 Java pods: - Dynamic: High CPU during startup, low during steady-state - Scaling: No restarts needed - Result: 30-40% cost savings observed in testing Key Takeaways Production-ready: GA status means stable for critical workloadsReal savings: 30-40% cost reduction for bursty workloadsQoS constraint: Plan resource changes to maintain QoS classFast: Changes apply in seconds, not minutes Best use cases: Java applications (high startup, low steady-state)ML inference (variable load)Batch processing (scale down after processing) Feature 2: Gang Scheduling (Alpha) Theory: The Distributed Workload Problem Modern AI/ML and big data workloads often require multiple pods to work together. Traditional Kubernetes scheduling treats each pod independently, leading to resource deadlocks: The problem: Shell PyTorch Training Job: Needs 8 GPU pods (1 master + 7 workers) Cluster: Only 5 GPUs available What happens: ├─ 5 worker pods scheduled → Consume all GPUs ├─ Master + 2 workers pending ├─ Training cannot start (needs all 8) ├─ 5 GPUs wasted indefinitely └─ Other jobs blocked This is called partial scheduling — some pods run, others wait, nothing works. What Is Gang Scheduling? Gang Scheduling ensures a group of pods (a "gang") schedules together atomically: Shell Training Job: Needs 8 GPU pods Cluster: Only 5 GPUs available With Gang Scheduling: ├─ All 8 pods remain pending ├─ No resources wasted ├─ Smaller jobs can run └─ Once 8 GPUs available → all schedule together Key principle: All or nothing. Implementation Challenge Kubernetes 1.35 introduces a native Workload API for gang scheduling (Alpha), but I discovered it requires feature gates that caused kubelet instability: YAML # Attempted native approach --feature-gates=WorkloadAwareScheduling=true # Result: kubelet failed to start Error: "context deadline exceeded" Solution: Use scheduler-plugins — the mature, production-tested implementation. Hands-On Testing Repository: lab2-gang-scheduling Setup: YAML # Automated installation ./setup-gang-scheduling.sh # What it installs: # 1. scheduler-plugins controller # 2. PodGroup CRD # 3. RBAC permissions Key discovery: Works with default Kubernetes scheduler — no custom scheduler needed! Test 1: Small Gang (Success Case) YAML apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: training-gang spec: scheduleTimeoutSeconds: 300 minMember: 3 # Requires 3 pods minimum YAML # Create 3 pods with the gang label for i in {1..3}; do kubectl apply -f training-worker-$i.yaml done Result: Plain Text NAME READY STATUS AGE training-worker-1 1/1 Running 6s training-worker-2 1/1 Running 6s training-worker-3 1/1 Running 6s All pods scheduled within 1 second of each other! PodGroup status: YAML status: phase: Running running: 3 Test 2: Large Gang (All-or-Nothing) Now let's prove gang behavior by creating a gang that's too large: YAML apiVersion: scheduling.x-k8s.io/v1alpha1 kind: PodGroup metadata: name: large-training-gang spec: minMember: 5 YAML # Create 5 pods requesting 600m CPU each # Total: 3000m (exceeds our 2 vCPU VM) for i in {1..5}; do kubectl apply -f large-training-$i.yaml done All 5 pods staying pending, proving all-or-nothing behavior Result: Plain Text NAME READY STATUS AGE large-training-1 0/1 Pending 15s large-training-2 0/1 Pending 15s large-training-3 0/1 Pending 15s large-training-4 0/1 Pending 15s large-training-5 0/1 Pending 15s Event: Plain Text Warning FailedScheduling 60s default-scheduler 0/1 nodes are available: 1 Insufficient cpu Perfect gang behavior: All pending, no partial scheduling, no wasted resources! Comparison: With vs Without Gang Scheduling Scenariowithout gangwith gangSmall gang (3 pods, enough resources)Schedule individuallyAll schedule togetherLarge gang (5 pods, insufficient resources)❌ Partial: 2-3 Running, rest PendingAll remain PendingResource efficiencyWasted (partial gang can't work)Efficient (resources available for other jobs)Deadlock preventionNo protectionProtected Production Considerations Alpha feature warning: Not recommended for production yetScheduler-plugins is the mature alternativeNative API will improve in K8s 1.36+ Production alternatives: Volcano SchedulerKAI Scheduler (NVIDIA)Kubeflow with scheduler-plugins Key Takeaways Critical for AI/ML: Distributed training needs gang schedulingPrevents deadlocks: All-or-nothing prevents resource wasteWorks today: scheduler-plugins is production-readyAlpha status: Native API needs maturation Best use cases: PyTorch/TensorFlow distributed trainingApache Spark jobsMPI applicationsAny multi-pod workload Feature 3: Structured Authentication Configuration (GA) Theory: The Authentication Configuration Challenge Traditional Kubernetes authentication uses command-line flags on the API server: Shell kube-apiserver \ --oidc-issuer-url=https://accounts.google.com \ --oidc-client-id=my-client-id \ --oidc-username-claim=email \ --oidc-groups-claim=groups \ --oidc-username-prefix=google: \ --oidc-groups-prefix=google: Problems: Command lines become extremely longDifficult to validate before restartNo schema validationHard to manage multiple auth providersRequires API server restart for changes What's New in K8s 1.35 Structured authentication configuration moves auth config to YAML files: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://accounts.google.com audiences: - my-kubernetes-cluster claimMappings: username: claim: email prefix: "google:" groups: claim: groups prefix: "google:" Benefits: Clear, structured formatSchema validationVersion controlledEasy to manage multiple providersBetter error messages Hands-On Testing Repository: lab3-structured-auth ⚠️ Warning: This lab modifies the API server configuration. While safe in minikube, this is risky in production without proper testing. The challenge: Modifying API server configuration requires editing static pod manifests — get it wrong and your cluster breaks. My approach: Create backup firstTest in disposable minikubeVerify thoroughly before production Test: GitHub Actions JWT Authentication I configured the API server to accept JWT tokens from GitHub Actions: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: - kubernetes-test claimMappings: username: claim: sub prefix: "github:" Implementation steps: Plain Text # 1. Create auth config cat > /tmp/auth-config.yaml <<EOF [config above] EOF # 2. Copy to minikube minikube cp /tmp/auth-config.yaml /tmp/auth-config.yaml # 3. Backup API server manifest minikube ssh sudo cp /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/backup.yaml # 4. Add authentication-config flag sudo vi /etc/kubernetes/manifests/kube-apiserver.yaml # Add: --authentication-config=/tmp/auth-config.yaml API server manifest showing authentication-config flag added API Server Restart: The API server automatically restarts when the manifest changes: Shell kubectl get pods -n kube-system -w | grep kube-apiserver Verification: Shell # Check authentication-config flag is active minikube ssh "sudo ps aux | grep authentication-config" Process showing --authentication-config=/tmp/auth-config.yaml flag API verification: Shell # Check authentication API is available kubectl api-versions | grep authentication Result: Shell authentication.k8s.io/v1 Success! Structured authentication is working. Before/After Comparison Before: YAML spec: containers: - command: - kube-apiserver - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC After: YAML spec: containers: - command: - kube-apiserver - --authentication-config=/tmp/auth-config.yaml # NEW! - --advertise-address=192.168.49.2 - --authorization-mode=Node,RBAC Multiple Providers Example The structured format makes multiple auth providers easy: YAML apiVersion: apiserver.config.k8s.io/v1beta1 kind: AuthenticationConfiguration jwt: - issuer: url: https://token.actions.githubusercontent.com audiences: [kubernetes-test] claimMappings: username: {claim: sub, prefix: "github:"} - issuer: url: https://accounts.google.com audiences: [my-cluster] claimMappings: username: {claim: email, prefix: "google:"} - issuer: url: https://login.microsoftonline.com/{tenant-id}/v2.0 audiences: [{client-id}] claimMappings: username: {claim: preferred_username, prefix: "azuread:"} Key Takeaways Production-ready: GA status, safe for critical clustersBetter management: Clear structure beats command-line flagsMulti-provider: Easy to configure multiple identity providersRequires restart: API server must restart to load config Best use cases: Organizations with multiple identity providersComplex authentication requirementsDynamic team structuresCompliance requirements Feature 4: Node Declared Features (Alpha) Theory: The Mixed-Version Cluster Problem During Kubernetes cluster upgrades, you typically have a rolling update: Plain Text Cluster During Upgrade: ├─ node-1 (K8s 1.34) → Old features ├─ node-2 (K8s 1.34) → Old features ├─ node-3 (K8s 1.35) → New features ✅ └─ node-4 (K8s 1.35) → New features ✅ The challenge: Scheduler doesn't know which nodes support which featuresPods using K8s 1.35 features might land on 1.34 nodes → FailManual node labeling requiredHigh operational overhead What Is Node Declared Features? Nodes automatically advertise their supported Kubernetes features: Plain Text status: declaredFeatures: - GuaranteedQoSPodCPUResize - SidecarContainers - PodReadyToStartContainersCondition Benefits: Automatic capability discoverySafe rolling upgradesIntelligent schedulingZero manual configuration Hands-On Testing Repository: lab4-node-features This Alpha feature requires enabling a feature gate in kubelet configuration. Initial state: Shell kubectl get --raw /metrics | grep NodeDeclaredFeatures Result: Shell kubernetes_feature_enabled{name="NodeDeclaredFeatures",stage="ALPHA"} 0 Feature disabled by default. Enabling the Feature Shell minikube ssh # Backup kubelet config sudo cp /var/lib/kubelet/config.yaml /tmp/backup.yaml # Edit kubelet config Add feature gate: YAML apiVersion: kubelet.config.k8s.io/v1beta1 featureGates: NodeDeclaredFeatures: true # ADD THIS authentication: anonymous: enabled: false Kubelet config after (featureGates added)] Restart kubelet: Shell sudo systemctl restart kubelet sudo systemctl status kubelet Verification Shell # Check node now declares features kubectl get node minikube -o jsonpath='{.status.declaredFeatures}' | jq Result: JSON [ "GuaranteedQoSPodCPUResize" ] Success! The node is advertising its capabilities! The Connection to Lab 1 Notice something interesting? The declared feature is GuaranteedQoSPodCPUResize - the exact capability we tested in Lab 1! What this means: Node running K8s 1.35 knows it supports in-place pod resizingAdvertises this capability automaticallyScheduler can route pods requiring this feature hereOlder nodes (K8s 1.34) wouldn't declare this feature Testing Feature-Aware Scheduling YAML # Create a pod kubectl apply -f feature-aware-pod.yaml YAML # Check scheduling kubectl get pod feature-aware-pod Result: Plain Text NAME READY STATUS RESTARTS AGE feature-aware-pod 1/1 Running 0 7s Complete test flow showing feature declared, pod created, and successfully scheduled] Pod successfully scheduled on feature-capable node! Future: Smart Scheduling In future Kubernetes versions (when this reaches Beta/GA), you'll be able to: YAML apiVersion: v1 kind: Pod metadata: name: resize-requiring-app spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node.kubernetes.io/declared-feature-InPlacePodVerticalScaling operator: Exists # Only schedule on nodes with this feature containers: - name: app image: myapp:latest Key Takeaways Automatic discovery: Nodes advertise capabilities without manual configSafe upgrades: Mixed-version clusters handled intelligentlyFeature connection: Links to Lab 1 in-place resize capabilityAlpha status: Requires feature gate, not production-ready Best use cases: Rolling cluster upgradesMixed-version environmentsFeature-dependent workloadsTesting new capabilities Lessons Learned: What Worked and What Didn't Challenges Encountered Alpha features are tricky Native Workload API caused kubelet failuresSolution: Used mature scheduler-plugins insteadLesson: Alpha doesn't mean "almost ready" QoS constraints not well-documented Spent time debugging resize failuresDiscovered QoS class immutability requirementLesson: Test thoroughly, document findings API server modifications are risky Required careful backup strategyMinikube made recovery easyLesson: Always test in disposable environments first What Worked Well GA features are solid In-place resize: FlawlessStructured auth: No issuesBoth ready for production Scheduler-plugins maturity More reliable than native Alpha APIsProduction-tested by many organizationsLesson: Mature external projects > Alpha native features Azure VM testing environment Realistic conditionsEasy to resetCost-effective (~$2 total)Lesson: Cloud VMs ideal for feature testing Production Readiness Assessment Ready for Production 1. In-place pod vertical scaling (GA) Stable, tested, documentedReal cost savings (30-40%)Clear constraints (QoS preservation)Recommendation: Deploy to production now 2. Structured authentication configuration (GA) Mature, well-designedBetter than command-line flagsRequires API server restartRecommendation: Use for new clusters, migrate existing ones carefully Use With Caution ⚠️ 3. Gang scheduling (Alpha) Native API unstableUse scheduler-plugins instead (production-ready)Essential for AI/ML workloadsRecommendation: Use scheduler-plugins, not native API 4. Node Declared Features (Alpha) Requires feature gateLimited current valueWill be critical when GARecommendation: Wait for Beta/GA unless testing upgrades Cost and Time Investment Testing Environment Costs Azure VM: Standard_D2s_v3Duration: 8 hours of testingCompute cost: ~$0.77 (VM stopped between sessions)Storage cost: ~$0.10Total: Less than $1 for comprehensive testing Time Investment activitytimeEnvironment setup30 minLab 1 (In-place resize)1.5 hoursLab 2 (Gang scheduling)2 hoursLab 3 (Structured auth)1 hourLab 4 (Node features)1.5 hoursDocumentation1.5 hoursTotal8 hours ROI: Knowledge gained far exceeds time invested. Testing prevented production issues. Recommendations for Your Kubernetes Journey If You're Running K8s 1.34 or Earlier Upgrade path: 1.34 → 1.35 is straightforwardFocus on GA features first: In-place resize, structured authTest in dev/staging: Use my repository as starting pointMeasure impact: Track cost savings from in-place resize If You're Running AI/ML Workloads Implement gang scheduling immediately: Use scheduler-pluginsTest distributed training: Prevent resource deadlocksMonitor scheduling: Ensure all-or-nothing behavior workingPlan for native API: Will mature in K8s 1.36+ If You're Managing Large Clusters Structured auth: Migrate now for better managementRolling upgrades: Plan for node feature declaration (future)Cost optimization: In-place resize reduces over-provisioningMulti-tenancy: Gang scheduling prevents noisy neighbor issues Complete Repository All code, scripts, and detailed instructions are available: GitHub: https://github.com/opscart/k8s-135-labs Each lab includes: Detailed theory and backgroundStep-by-step instructionsAutomated scripts where possibleTroubleshooting guidesProduction recommendationsRollback procedures Conclusion Kubernetes 1.35 brings meaningful improvements to production workloads: For cost optimization: In-place pod resize delivers real savings (30-40% in my tests)Eliminates over-provisioning for bursty workloadsNo application changes required For AI/ML workloads: Gang scheduling prevents resource deadlocksEssential for distributed trainingScheduler-plugins provides production-ready solution For operations: Structured authentication simplifies managementNode declared features will improve rolling upgradesBetter observability and debugging The bottom line: K8s 1.35 GA features are production-ready and deliver immediate value. Alpha features show promising future directions but need more maturation. Connect: Blog: https://opscart.comGitHub: https://github.com/opscartLinkedIn: linkedin.com/in/shamsherkhan Other projects: Kubectl-health-snapshot – Kubernetes Optimization Security Validatork8s-ai-diagnostics – Kubernetes AI Diagnostics References Kubernetes 1.35 Release NotesKEP-1287: In-Place Pod Vertical ScalingScheduler-Plugins DocumentationKEP-3331: Structured Authentication ConfigurationKEP-4568: Node Declared Features

By Shamsher Khan

CORE

Modern State Management: Signals, Observables, and Server Components

State management is a critical aspect of modern web applications. In the Angular ecosystem, reactivity has long been powered by observables (RxJS), a powerful but sometimes complex paradigm. Angular’s recent introduction of signals provides a new, intuitive reactivity model to simplify UI state handling. Meanwhile, frameworks like React are exploring server components that push some state management to the server side. This article compares these approaches: observables, signals, and server components, and when to use each in modern development. Observables and RxJS in Angular An observable represents a stream of values over time rather than a single static value. Angular makes heavy use of observables, for example, HttpClient methods return observables and NgRx uses observables for its global store. Instead of storing one value, an observable can emit a sequence of values asynchronously, and components subscribe to react to each emission. Observables follow a push model. When a new value is available, it’s pushed to all subscribers (listeners). For example: TypeScript const count$ = new BehaviorSubject(0); const doubleCount$ = count$.pipe(map(x => x * 2)); doubleCount$.subscribe(val => console.log('Double:', val)); count$.next(5); // Console: "Double: 10" In the snippet above, count$ is an observable holding a number (a BehaviorSubject with an initial value) and doubleCount$ is a derived observable that always emits twice the value of count$. When we call count$.next(5), subscribers of doubleCount$ receive the new value 10. When to use observables: Observables excel at asynchronous and event-driven scenarios. They are ideal for cases where data arrives over time, or you have multiple events to coordinate. Use observables for things like user input streams, live data updates from a server, or complex workflows that involve timing (debouncing, buffering, etc.). RxJS provides many operators to transform and combine streams, which is powerful for managing complex sequences. Trade-offs: The flexibility of observables comes at the cost of added complexity. You must subscribe to an observable to get its values, which introduces boilerplate. Debugging a chain of observable transformations can be challenging, and the paradigm of thinking in streams has a learning curve. For a simple UI state that doesn’t truly require asynchronous streams, using observables can feel like overkill. Signals: Fine-Grained Reactivity in Angular Signals are a newer reactive primitive that holds a single value and notifies dependents when that value changes. A signal is like a state variable that is reactive by default. Signals use a pull-based model, where you read the signal’s value directly, and Angular tracks this. When you update the signal, any code (or template) that reads it is automatically updated on the next cycle. For example, using signals in Angular: TypeScript import { signal, computed, effect } from '@angular/core'; const count = signal(0); const doubleCount = computed(() => count() * 2); effect(() => console.log('Double:', doubleCount())); count.set(5); // Console: "Double: 10" Here count is a writable signal initialized to 0. We define doubleCount as a computed signal that always equals count() * 2. The effect acts similarly to an observable subscription, running whenever doubleCount (and thus count) changes. When count.set(5) is called, the effect logs the new doubled value (10). All of this happens with no manual subscription or unsubscription. Angular handles dependency tracking and updates automatically. When to use signals: Signals are great for synchronous, local state in your Angular components or services. They shine in cases like form field states, toggles, counters, selection indicators, or any scenario where you have a value that the UI directly reflects. Signals make these cases simpler by removing RxJS boilerplate. You set a value, and the UI reacts. They also enable fine-grained updates; only parts of the UI, depending on a particular signal, will update when it changes, which can improve performance. Signals can replace many uses of BehaviorSubject or Observable for holding simple state. However, signals are not suited for sequences of events or asynchronous streams. If you need to handle an HTTP response, a timer, or a stream of user events, it’s better to use an observable for that and then update a signal with the result or use Angular’s interop helpers to bridge between them. In summary, use signals for the state, which is a single source of truth, and observables for data that evolves over time. Server Components and Server-Side State Server Components represent an architecture where some components run on the server instead of the client. React’s server components (RSC) are a recent example: for the first time, React components can execute entirely on the server and deliver pre-rendered HTML to the browser. The browser receives UI output (HTML/string data), not the component code, so it has less JavaScript to download and execute. The key benefit of this approach is performance. By rendering on the server, you can keep large libraries or heavy computations out of the client bundle; only static HTML is sent down, with no need to ship those libraries to the browser. Server components are also closer to your data sources, making data fetching more efficient and keeping sensitive data or keys safely on the server. Another benefit is that the server can cache and reuse rendered results for multiple users. However, server-side state is fundamentally different from client state. It’s ephemeral once the HTML is generated and sent to the browser; the user cannot directly trigger changes in that server-rendered UI without a round-trip to the server. In essence, server-side state is immutable during a render cycle; changing it won’t trigger re-renders in the browser. Therefore, purely server-rendered components work best for static or read-only parts of the UI. In practice, an application will mix server and client rendering. With React RSC, you might render the shell of a page and some data-heavy list via server components, then use client components for interactive pieces on that page. Angular’s analogue is traditional server-side rendering (SSR) using Angular Universal. SSR pre-renders the initial HTML on the server for a fast first paint, but after that, Angular takes over on the client with the full app. Unlike React’s RSC, Angular’s SSR still needs to send the entire app bundle to the browser for hydration, so the performance gain is mostly on the first load. React’s Server Components push the boundary further by never sending certain component code to the client at all. Both approaches aim to improve performance by leveraging the server, but they require thinking carefully about what can be rendered ahead of time versus what needs to be interactive. Choosing the Right Approach Each state management strategy has strengths, and they often complement each other Observables (RxJS): Use for asynchronous data streams and complex event handling. If you have values that change over time or multiple event sources, observables are the go-to solution. They come with many operators for filtering and combining data streams. Be mindful to manage subscriptions to avoid leaks and keep code maintainable.Signals: Use for a local state that represents a single value or a snapshot of data at a time. Signals simplify cases like toggling UI elements, tracking form input values, or deriving one piece of state from another without the overhead of RxJS. They make reactive code more straightforward in these cases. In general, use signals when you don’t need the full power of observables; you’ll write less code for the same result. If an asynchronous operation is involved, you might use an observable for the async part and then update a signal with the outcome.Server components/SSR: Use server-driven rendering to optimize initial load and offload heavy computation from the client. In Angular, use Universal to render pages on the server for a quick first paint. The result is faster performance and less JavaScript to download. Just balance it with client-side needs; interactive parts must still use client-side state (signals or observables), so server rendering is best for content that can be largely static on load. Conclusion Modern applications can benefit from all three approaches. In Angular, signals and observables often work together: “Signals for local, synchronous UI state and computed values observables for asynchronous workflows and complex streams,” and Angular’s SSR can handle the initial rendering on the server. Knowing when to use each approach will help you create applications that are efficient, maintainable, and highly responsive to the user.

By Bhanu Sekhar Guttikonda

Consensus in Distributed Systems: Understanding the Raft Algorithm

Consider a group of friends planning a weekend outing. To make the trip successful, they need consensus on the location, schedule, and budget. Typically, one person is chosen as the leader — responsible for decisions, tracking expenses, and keeping everyone informed, including any new members who join later. If the leader steps down, the group elects another to maintain continuity. In distributed computing, clusters of servers face a similar challenge — they must agree on shared state and decisions. This is achieved through Consensus Protocols. Among the most well-known are Viewstamped Replication (VSR), Zookeeper Atomic Broadcast (ZAB), Paxos, and Raft. In this article, we will explore Raft — designed to be more understandable while ensuring reliability in distributed systems. Consensus in Distributed Computing Consensus in its simplest form refers to a general agreement. In the weekend outing analogy, it refers to all friends agreeing on a location. It's quite likely that several options are considered before the group eventually agrees on a particular location. In distributed computing, too, one or more nodes may propose values. Of all these values, one of them needs to be agreed upon by all the nodes. It's up to the consensus algorithm to decide upon one of these values and propagate the decision to all the nodes. Formally, a consensus algorithm must satisfy the following properties: Uniform agreement – All the nodes agree upon the same value — even if the node itself has proposed a different value initially.Integrity – Once a value is agreed upon by the node, it shouldn’t change.Validity – If a node agrees to a value, it must have been proposed at least by one other node too.Termination – Eventually, every participating node agrees upon a value. The uniform agreement and integrity form the core idea of consensus — everyone agrees on the same value, and once decided, it's final. The validity property ensures the elimination of trivial behavior wherein a node agrees to a value irrespective of what has been proposed. The termination property ensures fault tolerance. If one or more nodes fails the cluster should progress and eventually agree upon a value. This also eliminates the possibility of a dictator node that takes all decisions and jeopardizes the whole cluster in case it fails. Of course, if all the nodes fails the algorithm can’t proceed. There is a limit to the number of failures an algorithm can tolerate. An algorithm that can correctly guarantee consensus amongst n nodes of which at most t fail is said to be t-resilient. In essence, the termination property can be termed as a liveness guarantee, while the remaining three are safety guarantees. Raft Raft stands for Reliable, Replicated, Redundant, and Fault-Tolerant, reflecting its design principles in distributed systems. It ensures reliability by maintaining consistent logs, replication across nodes for durability, redundancy to avoid single points of failure, and fault tolerance to continue operating despite crashes or network issues. Together, these qualities make Raft a robust consensus algorithm for distributed computing. Explanation Raft utilizes a leader approach to achieve consensus. In a Raft cluster, a node is either a leader or a follower. A node could also be a candidate for a brief duration when a leader is unavailable, i.e., leader election is underway. The cluster has one and only one elected leader, which is fully responsible for managing log replication on the other nodes of the cluster. It means that the leader can decide between them and the other nodes without consulting the other nodes. A leader leads until it fails or disconnects, in which case remaining nodes elect a new leader. Fundamentally, the consensus problem is broken into two independent sub-problems in Raft: Leader Election and Log Replication. Leader Election Leader election in Raft occurs when the current leader fails or during initialization. Each election begins a new term, a time period in which a leader must be chosen. A node becomes a candidate if it doesn’t receive heartbeats from a leader within the election timeout. It then increments the term, votes for itself, and requests votes from others. Nodes vote once per term, on a first-come, first-served basis. A candidate wins if it secures a majority; otherwise, initiating a new term and election. Randomized timeouts reduce split votes by staggering candidate starts, ensuring quicker resolution and stable leadership through heartbeat messages. Raft is not Byzantine fault tolerant; the nodes trust the elected leader, and the algorithm assumes all participants are trustworthy. Log Replication The leader manages client requests and ensures consistency across the cluster. Each request is appended to the leader’s log and sent to followers. If followers are unavailable, the leader retries until replication succeeds. Once a majority of followers confirm replication, the entry is committed, applied to the leader’s own state, and considered durable. This also commits prior entries, which followers then apply to their own state, maintaining log consistency across cluster. In case a leader crashes, inconsistencies may arise if some entries were not fully replicated. A new leader resolves this by reconciling logs. It identifies the last matching entry with each follower, deletes conflicting entries in their logs, and replaces them with its own. Thus ensuring consistency even after failures. Additional Considerations Raft algorithm has below additional consideration for a robust consensus algorithm for distributed computing. Safety Guarantee Raft ensure below safety guarantees: Election safety – at most one leader can be elected in a given term.Leader append-only – a leader can only append new entries to its logs (it can neither overwrite nor delete entries).Log matching – if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.Leader completeness – if a log entry is committed in a given term then it will be present in the logs of the leaders since this term.State safety – if a node has applied a particular log entry to its state , then no other node may apply a different command for the same log. Cluster Membership Changes Raft handles cluster membership changes using joint consensus, a transitional phase where both old and new configurations overlap. During this phase, log entries must be committed to both sets, leaders can come from either, and elections require majorities from both. Once new configuration is replicated to a majority of its nodes, the system fully transitions. Raft also addresses the three challenges: New nodes without logs are excluded from majorities until caught up.Leaders not in new configuration step down to followers.Nodes still with old configuration that still recognize a leader ignore disruptive vote requests. Log Compaction Log compaction in Raft works by nodes taking snapshots of committed log entries, storing them with the last index and term. Leaders send these snapshots to lagging nodes, which then discard their log entirely or truncate it up to the snapshot’s latest entry. This also ensures durability in Raft. Limitations of Raft Raft has its own limitations with trades off scalability and flexibility as compared to other consensus algorithms. Leader bottleneck – Raft relies heavily on a single leader to coordinate log replication. If the leader fails, the system pauses until a new leader is elected, which can slow progress.Scaling – Raft doesn’t scale well to very large clusters — leader elections and log replication become slower and riskier as the number of nodes grows.Network partitions – this can cause temporary unavailability, since Raft prioritizes consistency over availability. An edge case exist where the elected leader is forced to resign and leadership switched between nodes continuously. Thus, forcing whole cluster to halt. Real World Production Usage of Raft Etcd uses Raft to manage a highly-available replicated log — utilized primarily in Kubernetes cluster for configuration management.Neo4j uses Raft to ensure consistency and safety.Apache Kafka Raft (KRaft) uses Raft for metadata management. In the recent version KRaft replaced Apache Zookeeper in Kafka.Camunda uses the Raft consensus algorithm for data replication. Raft vs. Paxos Raft was introduced to make consensus easier to understand and implement compared to Paxos. While Paxos is theoretically robust, it’s notoriously complex, making it hard for engineers to build reliable systems from it. Raft simplifies the process by breaking consensus into clear steps — leader election, log replication, and safety — without sacrificing correctness. This clarity makes Raft more approachable for real-world distributed systems. When to Choose Raft – Useful when building new distributed systems where clarity, maintainability, and developer adoption matter (e.g., databases, coordination services).Paxos – Useful in academic or highly specialized systems where theoretical rigor is prioritized over ease of implementation. In practice, Raft is usually the better choice for modern engineering teams because it balances correctness with simplicity. Future Trends in Consensus Future consensus algorithms are moving beyond leader-based models like Raft and Paxos. A key trend is leaderless consensus, where no single node coordinates decisions. Instead, all nodes collaborate equally, reducing the risk of a single point of failure. This makes systems more resilient and fair, especially in global networks where reliability is critical. For example, in blockchain or distributed databases, leaderless designs help ensure trust and consistency without relying on one “boss” node. Another trend is scalability-focused consensus, which aims to cut down communication overhead. As systems grow to thousands of nodes, traditional methods struggle with efficiency. New protocols are exploring ways to minimize message exchanges while still guaranteeing agreement. Also hybrid approaches are explored combining leaderless designs with probabilistic or quorum-based methods. These balance speed and fault tolerance, making them suitable for high-performance applications. Finally, energy-efficient consensus is gaining attention, especially in blockchain, where proof-of-work is costly. Future algorithms will likely emphasize greener, lightweight mechanisms. Consensus is evolving toward fairness, scalability, and sustainability — ensuring distributed systems can handle global scale without sacrificing reliability. Conclusion Raft simplifies the complex world of distributed consensus by breaking it into clear steps — leader election, log replication, and safety guarantees. While engineers may not encounter Raft every day, understanding it is essential when making architectural or design decisions for systems that demand reliability and consistency. Raft ensures that clusters agree on shared state even in the face of failures, though it comes with trade‑offs like leader bottlenecks and limited scalability. Its adoption in tools such as etcd, Kafka, and Neo4j shows its practical importance. Compared to Paxos, Raft is easier to grasp and implement, making it a strong foundation for modern distributed systems. As consensus evolves toward leaderless and scalable designs, Raft remains a critical concept every architect should be aware of when shaping resilient, fault‑tolerant solutions. References and Further Reading ConsensusRaft AlgorithmRaft (GitHub)Designing Data-Intensive ApplicationsPatterns of Distributed Systems

By Ammar Husain

CORE

Implementing Sharding in PostgreSQL: A Comprehensive Guide

As applications scale and data volumes increase, efficiently managing large datasets becomes a core requirement. Sharding is a common approach used to achieve horizontal scalability by splitting a database into smaller, independent units known as shards. Each shard holds a portion of the overall data, making it easier to scale storage and workload across multiple servers. PostgreSQL, as a mature and feature-rich relational database, offers several ways to implement sharding. These approaches allow systems to handle high data volumes while maintaining performance, reliability, and operational stability. This guide explains how sharding can be implemented in PostgreSQL using practical examples and clear, step-by-step instructions. In a sharded setup, table data is distributed across multiple nodes based on a chosen sharding key. For instance, a customer table may be split by region or customer_id, with each shard storing a specific subset of records. The primary challenge lies in routing queries and transactions to the correct shard while preserving data consistency and application transparency. PostgreSQL supports sharding through built-in features such as postgres_fdw and table partitioning, as well as extensions like Citus for more advanced and large-scale deployments. Setting Up Sharding in PostgreSQL To demonstrate the approach, consider a scenario in which sharding is implemented for a Sales table. In this example, sales data is distributed across multiple regions using region_id as the sharding key. Each region is assigned its own shard, allowing the data to be spread across multiple databases while keeping it logically organized. The configuration involves creating individual shards, setting up PostgreSQL to handle data distribution, and ensuring that queries are routed to the correct shard. The process begins with the base PostgreSQL setup. PostgreSQL should be installed on all required systems. A primary database is then created, which the application connects to directly. This database acts as the coordinator node, responsible for directing queries to the appropriate regional shards based on the sharding logic. SQL -- Step 1: Create the main database CREATE DATABASE sales_db; -- Step 2: Connect to the main database \c sales_db Once connected, create a schema that defines the structure of the sales table. Instead of creating a single monolithic table, define the schema without immediately populating it with data. Instead, shards will be created as partitions, with data distributed across them based on regions. SQL -- Step 3: Define the Sales table schema CREATE TABLE sales ( sale_id SERIAL PRIMARY KEY, region_id INT NOT NULL, sale_amount DECIMAL(10, 2), sale_date DATE NOT NULL ) PARTITION BY LIST (region_id); The PARTITION BY LIST clause specifies how region_id determines data placement. For each region, a partition (a shard) will be created. For example, if you have three regions, you might create separate shards as follows: SQL -- Step 4: Create individual shards for each region CREATE TABLE sales_region_1 PARTITION OF sales FOR VALUES IN (1); CREATE TABLE sales_region_2 PARTITION OF sales FOR VALUES IN (2); CREATE TABLE sales_region_3 PARTITION OF sales FOR VALUES IN (3); In this example, the sales_region_1 table will store all records where region_id = 1, while sales_region_2 will store data for region_id = 2, and so on. Each shard can be hosted on a different PostgreSQL server to provide scalability. Configuring Foreign Data Wrappers for Distributed Shards To enable distributed sharding, use PostgreSQL’s postgres_fdw extension. This extension allows you to connect to remote PostgreSQL instances and treat them as part of the database, enabling efficient queries across shards. Install the extension and configure it as follows: SQL -- Step 5: Enable the postgres_fdw extension CREATE EXTENSION IF NOT EXISTS postgres_fdw; -- Step 6: Create a foreign server for each shard CREATE SERVER shard_1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard1_host', dbname 'shard1_db', port '5432'); CREATE SERVER shard_2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard2_host', dbname 'shard2_db', port '5432'); CREATE SERVER shard_3 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard3_host', dbname 'shard3_db', port '5432'); -- Step 7: Create user mappings for each server CREATE USER MAPPING FOR CURRENT_USER SERVER shard_1 OPTIONS (user 'postgres', password 'password'); CREATE USER MAPPING FOR CURRENT_USER SERVER shard_2 OPTIONS (user 'postgres', password 'password'); CREATE USER MAPPING FOR CURRENT_USER SERVER shard_3 OPTIONS (user 'postgres', password 'password'); Now associate each shard (partition) with its corresponding remote server using foreign tables. This allows PostgreSQL to route queries to the appropriate server. SQL -- Step 8: Import foreign schemas for each shard CREATE FOREIGN TABLE sales_region_1 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_1 OPTIONS (schema_name 'public', table_name 'sales_region_1'); CREATE FOREIGN TABLE sales_region_2 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_2 OPTIONS (schema_name 'public', table_name 'sales_region_2'); CREATE FOREIGN TABLE sales_region_3 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_3 OPTIONS (schema_name 'public', table_name 'sales_region_3'); Testing the Sharding Setup After setting up the shards, test the configuration by inserting data into the sales table and verifying that it is correctly routed to the appropriate shard. SQL -- Insert data into the main sales table INSERT INTO sales (region_id, sale_amount, sale_date) VALUES (1, 100.50, '2023-10-01'); INSERT INTO sales (region_id, sale_amount, sale_date) VALUES (2, 200.75, '2023-10-02'); INSERT INTO sales (region_id, sale_amount, sale_date) VALUES (3, 300.25, '2023-10-03'); -- Verify that data is stored in respective shards SELECT FROM sales_region_1; SELECT FROM sales_region_2; SELECT * FROM sales_region_3; Each query above should retrieve the respective rows routed to the appropriate shard. This confirms that the sharding setup is functioning correctly. Querying and Maintaining Sharded Data PostgreSQL ensures that queries to the sales table are automatically redirected to the appropriate shard based on the region_id value. Complex queries, such as aggregations across all regions, are also supported, as PostgreSQL can parallelize query execution across shards using postgres_fdw. SQL -- Example: Aggregated sales across all shards SELECT SUM(sale_amount) AS total_sales FROM sales WHERE sale_date >= '2023-10-01'; Maintenance tasks, such as adding a new shard for additional regions, can be managed seamlessly by creating new partitions and foreign table mappings as required. For example, a new region (region_id = 4) can be supported by adding a new shard: SQL -- Add a new shard CREATE TABLE sales_region_4 PARTITION OF sales FOR VALUES IN (4); CREATE SERVER shard_4 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard4_host', dbname 'shard4_db', port '5432'); CREATE USER MAPPING FOR CURRENT_USER SERVER shard_4 OPTIONS (user 'postgres', password 'password'); CREATE FOREIGN TABLE sales_region_4 ( sale_id SERIAL, region_id INT, sale_amount DECIMAL(10, 2), sale_date DATE ) SERVER shard_4 OPTIONS (schema_name 'public', table_name 'sales_region_4'); Conclusion Sharding in PostgreSQL provides a practical way to achieve horizontal scalability, particularly for large and growing datasets in distributed environments. By using built-in features such as postgres_fdw and partitioning, PostgreSQL can execute queries across shards transparently, without requiring complex logic in the application layer. This guide has walked through a step-by-step approach to implementing sharding for a table with uneven data distribution, using practical examples to demonstrate how PostgreSQL can be scaled to support high-performance, data-intensive applications.

By arvind toorpu

CORE