DZone Spotlight

Tuesday, February 3 View All Articles »

Ralph Wiggum Ships Code While You Sleep. Agile Asks: Should It?

By Stefan Wolpers

CORE

TL; DR: When Code Is Cheap, Discipline Must Come from Somewhere Else Generative AI removes the natural constraint that expensive engineers imposed on software development. When building costs almost nothing, the question shifts from “can we build it?” to “should we build it?” The Agile Manifesto’s principles provide the discipline that these costs are used to enforce. Ignore them at your peril when Ralph Wiggum meets Agile. The Nonsense About AI and Agile Your LinkedIn feed is full of confident nonsense about Scrum and AI. One camp sprinkles "AI-powered" onto Scrum practices like seasoning. They promise that AI will make your Daily Scrum more efficient, your Sprint Planning more accurate, and your Retrospectives more insightful. They have no idea what Scrum is actually for, and AI amplifies their confusion, now more confidently presented. (Dunning-Kruger as a service, so to speak.) The other camp declares Scrum obsolete. AI agents and vibe coding/engineering will render iterative frameworks unnecessary, they claim, because software creation will happen while you sleep at zero marginal cost. Scrum, in their telling, is rigid dogma unfit for a world of autonomous code generation; a relic in the new world of Ralph Wiggum-style AI development. Both camps miss the point entirely. The Expense Gate Ralph Wiggum Eliminates For decades, software development had a natural constraint: engineers were expensive. A team of five developers costs $750,000 or more annually, fully loaded. That expense imposed discipline. You could not afford to build the wrong thing. Every feature required justification. Every iteration demanded focus. The cost was a gate. It forced product decisions. Generative AI removes that gate. Code generation approaches zero marginal cost. Tools like Cursor, Claude, and Codex produce working code in minutes. Vibe coding turns product ideas into functioning prototypes before lunch. The trend is accelerating. Consider the "Ralph Wiggum" technique now circulating on tech Twitter and LinkedIn: an autonomous loop that keeps AI coding agents working for hours without human intervention. You define a task, walk away, and return to find completed features, passing tests, and committed code. The promise is seductive: continuous, autonomous development in which AI iterates on its own work until completion. Geoffrey Huntley, the technique's creator, ran such a loop for three consecutive months to produce a functioning programming language compiler. [1] Unsurprisingly, the marketing writes itself: "Ship code while you sleep." But notice what disappears in this model: Human judgment about what is worth building. Review cycles that catch architectural mistakes. The friction that forces teams to ask whether a feature deserves to exist. As one practitioner observed about these autonomous loops: "A human might commit once or twice a day. Ralph can pile dozens of commits into a repo in hours. If those commits are low quality, entropy compounds fast." [2] The expense gate is gone. The abundance feels liberating. It is also dangerous. Without the expense gate, what prevents teams from running in the wrong direction faster than ever? What stops organizations from generating mountains of features that nobody wants? What enforces the discipline that cost used to provide? The Principles Provide the Discipline The answer is exactly what the Agile Manifesto was designed to provide. Start with the first value: "Working software over comprehensive documentation." In an AI world, generating documentation is trivial. Generating working software is trivial. But generating working software that solves actual customer problems remains hard. The emphasis on "working" was never about the code compiling. It was about the software doing something useful. That distinction matters more now, not less. Then there is simplicity: "the art of maximizing the amount of work not done." When engineers cost $150K annually, leaving out features of questionable value saved money. Now that building costs almost nothing, leaving features out requires discipline rather than economics. The product person who asks "should we build this?" instead of "can we build this?" becomes more valuable, not less. "Working software is the primary measure of progress." AI can generate a thousand lines of code per hour. None of those represents progress itself. Instead, progress is measured by working software in users' hands who find it useful. Customer collaboration and feedback loops provide that measurement. Output velocity without validation is a waste at unprecedented scale. And then technical excellence: "Continuous attention to technical excellence and good design enhances agility." This principle now separates survival from failure. The Technical Debt Trap Autonomous AI development produces code that works well enough to ship. The AI generates plausible implementations that pass tests and satisfy immediate requirements. Six months later, the same team discovers the horror beneath the surface. You build it, you ship it, you run it. And now you maintain it. This is "artificial" technical debt compounding at unprecedented rates. The Agile Manifesto called for "sustainable development" and for teams to maintain "a constant pace indefinitely." These were not bureaucratic overhead invented by process enthusiasts. They were survival requirements learned through painful experience. Organizations that abandon these principles because AI makes coding cheap will discover a familiar pattern: initial velocity followed by grinding slowdown. The code that was so easy to generate becomes impossible to maintain. The features that shipped so quickly become liabilities that cannot be safely modified. Technical excellence is not optional in an AI world. It is the difference between a product and a pile of unmaintainable code. The "Should We Build It" Reframe The fundamental question of product development has always been: are we building the right thing? When building was expensive, the expense itself forced that question. Teams could not afford to build everything, so they had to choose. Product people had to prioritize ruthlessly. Stakeholders had to make tradeoffs. Now that building is cheap, the forcing function is gone. Organizations can build everything. Or at least they think they can. The pressure compounds from above. Management and stakeholders are increasingly factoring in faster product delivery enabled by AI capabilities. Late changes that once required difficult conversations now seem costless. Prototypes that once took weeks can appear in hours. The expectation becomes: if AI can build it faster, why are we not shipping more? This pressure makes disciplined product thinking harder precisely when it matters most. The Agile Manifesto's emphasis on "customer collaboration" and "responding to change" exists precisely because requirements emerge through discovery, not specification. Feedback loops with real users matter more when teams can produce working software faster. Without those loops, teams generate features in a vacuum, disconnected from the people who must find them valuable. The product person who masters this discipline becomes irreplaceable. The product person who treats the backlog as a parking lot for every idea becomes a liability at scale, approving AI-generated waste faster than ever before. What Stays, What Changes in the Age of Ralph Wiggum & Agile The core feedback loops remain essential: build something small, show it to users, learn from the response, adapt. That rhythm predates any framework. It will outlast whatever comes next. Iteration cycles may compress. If teams can produce meaningful working software in days rather than weeks, shorter cycles make sense. The principle remains: deliver working software frequently. The specific cadence adapts to capability. The challenge function becomes more critical, not less. In effective teams, Developers have always pushed back on product suggestions: "Is this really the most valuable thing we can build to solve our customers' problems?" This tension is healthy. Life is a negotiation, and so is Agile. When AI can generate implementation options in minutes, this challenge function becomes the primary source of discipline. The question shifts from "how long will this take?" to "should we build this at all?" and "how will we know it works?" Customer feedback loops matter more when velocity increases. These loops have always been about closing the gap between what teams build and what customers need, inspecting progress toward meaningful outcomes, and adapting the path when reality contradicts assumptions. When teams can produce more working software faster, these checkpoints become sharper. The question shifts from "look what we built" to "based on what we learned, what should we build next?" Daily coordination adapts in form, not purpose. The goal remains: inspect progress and adapt the plan. Standing in a circle reciting yesterday's tasks has always been useless compared to answering: are we still on track, and what is blocking us? Now, it becomes critical: faster implementation cycles make frequent synchronization more important, not less. Technical discipline becomes survival, not overhead. The harder problem is helping teams maintain quality standards when shipping is frictionless. Practitioners who can spot AI-generated code smell, who insist on meaningful review, who protect quality definitions from erosion under delivery pressure: these people become more valuable. Those who focus primarily on the "process," delivered dogmatically, become redundant. Product accountability becomes the constraint, and that is correct. When implementation is cheap, product decisions become the bottleneck. The person who can rapidly validate assumptions, say no to plausible but valueless features, and maintain focus becomes the team's most critical asset. These are adaptations, not abandonment. The principles survive because they address a permanent problem: building software that solves customer problems in complex environments. AI changes the cost structure. It does not change the problem. We Are Not Paid to Practice Scrum I have said this before, and it applies directly here: we are not paid to practice Scrum. We are paid to solve our customers' problems within the given constraints while contributing to the organization's sustainability. Full disclosure: I earn part of my living training people in Scrum. I have skin in this game. But the game only matters if Scrum actually helps teams deliver value. If Scrum helps accomplish your goals, use Scrum. If parts of Scrum no longer serve that goal in your context, adapt. The Scrum Guide itself says Scrum is a framework, not a methodology. It is intentionally incomplete. The "Scrum is obsolete" camp attacks a caricature: rigid ceremonies enforced dogmatically without regard for outcomes. That caricature exists in some organizations. It is not Scrum. It is a bad implementation that the Agile Manifesto warned against in its first value: "Individuals and interactions over processes and tools." The question is not whether to practice Agile by the book. The question is whether your team has the feedback loops, the discipline, and the customer focus to avoid building the wrong thing at AI speed. If you have those things without calling them Agile, fine. Call it whatever you want. The labels do not matter. The outcomes do. If you lack those things, AI will not save you. It will accelerate your failure. Conclusion: Do Not Outsource Your Thinking The tools have changed. The fundamental challenge has not. Building software that customers find valuable, in complex environments where requirements emerge through discovery rather than specification, remains hard. The expense gate is gone, but the need for discipline remains. The Agile Manifesto's principles provide that discipline. They are not relics of a pre-AI world. They are the antidote to AI-accelerated waste. Do not outsource your thinking to AI. The ability to generate code instantly does not answer the question that matters. Just because you could build it, should you? What discipline has replaced the expense gate in your organization? Or has nothing replaced it yet? I am curious. Ralph Wiggum and Agile: The Sources Ralph Wiggum: Autonomous Loops for Claude Code11 Tips For AI Coding With Ralph Wiggum More

ToolOrchestra vs Mixture of Experts: Routing Intelligence at Scale

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Last year, I came across Mixture of Experts (MoE) through this research paper published in Nature. Later in 2025, Nvidia published a research paper on ToolOrchestra. While reading the paper, I kept thinking about MoE and how ToolOrchestra is similar to or different from it. In this article, you will learn about two fundamental architectural patterns reshaping how we build intelligent systems. We'll explore ToolOrchestra and Mixture of Experts (MoE), understand their inner workings, compare them with other routing-based architectures, and discover how they can work together. What Is Mixture of Experts? Simply put, Mixture of Experts is an architectural pattern that splits a large model into multiple specialized sub-networks called experts. Instead of one monolithic model handling every input, you activate only the experts needed for each specific task. The concept dates back to 1991 with the paper "Adaptive Mixture of Local Experts." The core idea is straightforward: route each input to the most suitable expert, activate only what you need, and keep the rest idle. Mixture of Experts How MoE Works In transformer models, MoE layers typically replace the feedforward layers. These feedforward layers consume most of the compute as models scale. Replace them with MoE, and you get massive efficiency gains. Key components: Gating network – Decides which experts process which tokensExperts – Specialized sub-networks (typically feedforward networks)Load balancing – Ensures no single expert gets overwhelmedSparse activation – Only activates selected experts per token Routing strategies: StrategyDescriptionExample ModelTop-1Each token goes to one expertSwitch TransformerTop-2Each token goes to two expertsGShard, Mixtral 8x7BExpert ChoiceExperts select tokensExpert Choice RoutingSoft RoutingWeighted combination of all expertsSoft MoE What Is ToolOrchestra? ToolOrchestra, introduced by NVIDIA researchers in November 2025, takes a different approach. Instead of splitting one model into parts, it uses a small 8-billion-parameter model to coordinate multiple complete models and tools. Think of it as a conductor leading an orchestra. The orchestrator model analyzes a problem, breaks it down, and calls different "instruments" to solve each piece. ToolOrchestra flow How ToolOrchestra Works The breakthrough is in how it learns to orchestrate. ToolOrchestra uses reinforcement learning with three reward types: Reward structure: Reward TypePurposeFocusOutcomeGetting the right answerCorrectnessEfficiencyUsing cheaper tools when possibleCost optimizationPreferenceRespecting user tool preferencesUser control The training uses a synthetic data pipeline called ToolScale. It automatically generates databases, API schemas, and complex tasks with verified solutions. This gives the orchestrator thousands of examples to learn from through trial and error. Core Differences Let me break down the fundamental differences between these two approaches: AspectMixture of ExpertsToolOrchestraGranularityToken-level routingTask-level routingScopeWithin a single modelAcross multiple systemsComponentsSub-networks (experts)Complete models and toolsTrainingJoint training of all expertsOnly orchestrator trainsActivationSparse parameter activationSelective system invocationMemoryAll experts in memoryTools loaded on demandExternal AccessNo external toolsWeb, APIs, databases The fundamental difference is in what gets split up. MoE splits a single model's parameters into specialized sub-networks. All experts live inside one model architecture, sharing the same input and working on the same task at the parameter level. ToolOrchestra splits tasks across different complete systems. The orchestrator is a small, standalone model that coordinates other models and tools. Each tool or model it calls is fully independent, potentially running on different hardware, using different architectures, and even created by different companies. ToolOrchestra vs. MoEs Commonalities and Shared Principles Both architectures attack the same problem: inefficiency. Running a massive model for every task wastes compute and money. MoE and ToolOrchestra both use sparsity and specialization to avoid this waste. Shared design principles: Routing as a core mechanism – Both use learned routing to direct inputs to the right specialistsModularity – Break down monolithic systems into specialized componentsSparsity – Activate only what you need for each inputAutomatic learning – Routing policies are learned, not hardcodedSpecialization over generalization – Focused experts outperform generalists on specific tasks Related Architectural Methods Several other patterns fit into this landscape of modular, routing-based intelligence. Let me walk you through the key ones. Before diving into the specific architectures, I want to mention that I've written extensively about AI infrastructure and optimization techniques. If you're interested in understanding how these architectural patterns work in production environments, check out my article on NVIDIA MIG with GPU Optimization in Kubernetes, which covers how GPU partitioning works similarly to expert routing in MoE systems. Ensemble Methods Ensemble learning combines predictions from multiple models. Unlike MoE, where routing is learned, ensembles often use simpler combination strategies. Common ensemble techniques: TechniqueHow It WorksBest ForBaggingTrain on different data subsetsReducing varianceVotingMajority vote or averagingClassification tasksStackingMeta-learner combines predictionsComplex problemsWeighted AverageLearned weights for each modelRegression tasks Ensemble method architecture The key difference from MoE is that ensemble methods typically run all models for every input, then combine results. MoE only activates selected experts. Ensemble methods are simpler to implement but less efficient. Capsule Networks With Dynamic Routing Capsule Networks, introduced by Geoffrey Hinton in 2017, use a routing mechanism called "routing-by-agreement." While different from MoE and orchestration, capsules share the idea of learned routing. How capsule routing works: Instead of routing tokens to experts, capsules route outputs to higher-level capsules based on agreement. Lower-level capsules send their output vectors to higher-level capsules that "agree" with their predictions. Capsule routing Key concepts: Capsules as vectors: Unlike neurons that output scalars, capsules output vectors. The length represents probability, the direction represents properties.Dynamic routing: Iteratively updates routing coefficients based on agreement between predictionsSpatial relationships: Better at understanding part-whole relationships in images FeatureTraditional CNNCapsule NetworkOutputScalar activationsVector capsulesRoutingFixed poolingDynamic routingSpatial InfoLost through poolingPreserved in vectorsIterationsNone3-5 routing iterations Multi-Agent Neural Systems These architectures organize intelligence as multiple cooperating agents. Each agent is a specialized neural network that communicates with others. Agent-based architectures: Modular Graph Neural Networks (ModGNN) – Agents communicate through graph structures for multi-agent coordinationNeural Agent Networks (NAN) – Distributed systems where agents act like neuronsAgentic Neural Networks – Self-evolving systems that optimize both structure and prompts Multi-agent systems Comparison with MoE and Orchestration: ArchitectureCommunicationIndependenceTrainingMoEThrough gatingSub-networksJointToolOrchestraThrough orchestratorFully independentSeparateMulti-AgentPeer-to-peerSemi-independentCan be joint or separate Hierarchical Mixture of Experts HMoE adds layers to the basic MoE pattern. First-level routing decides broad categories, then second-level routing picks specific experts. HMoE routing This pattern is similar to ToolOrchestra's hierarchical potential. Both can build multi-level routing systems. Retrieval-Augmented Generation (RAG) RAG combines language models with retrieval systems. Before generating, the system searches a database for relevant information. RAG architecture: ComponentPurposeSimilar ToQuery EncoderTransform inputMoE gatingRetrieverFind relevant docsTool selectionReader/GeneratorProduce answerExpert activation RAG is closer to orchestration than MoE. The retriever acts like a tool, and the generator coordinates between input and retrieved knowledge. Compound AI Systems Compound AI Systems, as defined by Berkeley AI Research, tackle tasks using multiple interacting components. This is the broad category that includes both orchestration and some ensemble methods. I recently wrote about the Model Context Protocol (MCP), which is Anthropic's approach to standardizing how AI systems connect with external data sources. MCP represents a compound AI pattern where models orchestrate access to various data sources through a universal protocol. The principles align closely with ToolOrchestra's approach to coordinating multiple tools. Characteristics: Multiple model calls in sequence or parallelExternal tools (databases, APIs, code execution)Retrieval and generation combinedMulti-step reasoning chains Examples: Chain-of-Thought systems – Break problems into reasoning stepsReAct (Reasoning + Acting) – Combine reasoning with tool useAutoGPT-style agents – Autonomous task decomposition and execution Neural-Symbolic Architectures These systems integrate neural networks with symbolic reasoning. The neural part handles pattern recognition, the symbolic part handles logical reasoning. Layered architecture: LayerTypeFunctionPerceptionNeuralProcess sensory inputReasoningSymbolicApply logical rulesPlanningHybridCombine both approaches This is similar to orchestration, where different tools have different capabilities. The routing decides whether to use neural or symbolic processing. Comparison Table: All Architectures ArchitectureRouting TypeComponentsTrainingBest Use CaseMoEToken-level, learnedSub-networksJointParameter efficiencyToolOrchestraTask-level, learnedIndependent systemsOrchestrator onlyFlexible compositionEnsembleNo routing / simpleIndependent modelsSeparateReducing varianceCapsule NetworksAgreement-basedVector capsulesJointSpatial relationshipsMulti-AgentPeer communicationAutonomous agentsJoint or separateComplex coordinationHMoEMulti-level, learnedHierarchical expertsJointNested specializationRAGQuery-basedRetriever + GeneratorSeparateKnowledge groundingCompound AIMulti-stepChains of componentsMixedComplex workflows Implementation Considerations When building systems with these architectures, keep these points in mind: When to use MoE: Training massive models from scratchNeed parameter efficiency at inferenceSingle model deployment preferredHave expertise in distributed training When to use ToolOrchestra: Building applications with existing modelsNeed to swap components frequentlyWant to use external tools (APIs, databases)Prefer faster iteration and easier maintenance When to use ensembles: Have multiple trained modelsWant simple implementationCan afford running all modelsNeed variance reduction When to use multi-agent: Complex coordination neededAgents should learn from each otherReal-time communication requiredDistributed decision making For those interested in the infrastructure side of deploying these architectures, I've written several articles that might help. My piece on Multizone Kubernetes and VPC Load Balancer Setup shows how to deploy distributed systems across zones, which is similar to how you'd deploy multiple experts or orchestrated models. I've also published guides on DZone about cloud infrastructure and automation that apply directly to deploying these kinds of systems. Conclusion The trend is clear: break things into specialized components, learn to route intelligently, activate only what you need. Whether that specialization happens inside a model through MoE, across models through orchestration, or through multi-agent coordination, the principle holds. Emerging patterns: Multi-level routing – Orchestration at application level, MoE at model level, capsule-like routing for spatial featuresDynamic expert creation – Models that spawn new experts as needed based on task distributionCross-architecture routing – Systems that route between fundamentally different architectures (transformers, RNNs, symbolic systems)Learned cost functions – Systems that optimize for user-specific cost/quality tradeoffsFederated orchestration – Orchestrators coordinating models across different organizations Research directions: Better routing algorithms that generalize across domainsAutomatic architecture search for routing patternsEfficient training methods for sparse systemsTheoretical understanding of when routing helpsCombining symbolic and neural routing Key Takeaways If you've made it this far, here's what you should remember: MoE splits parameters, ToolOrchestra splits tasks, Ensembles split predictions, and multi-agents split responsibilities. All use routing, but at different levels and for different purposes.They complement each other. An orchestrator can coordinate MoE models. Capsule networks can use MoE-style experts. Multi-agent systems can use orchestration for high-level coordination. The combinations are endless.The future is modular. Neither approach alone is the answer. The next generation of systems will use multiple levels of routing and specialization working together.Start small, scale up. You don't need to build everything at once. Start with simple routing logic. Add specialized components. Layer on complexity as you learn what works. This isn't just about saving compute. It's about making intelligence more accessible, more controllable, and more aligned with how we actually want to use it. Breaking monolithic systems into specialized, coordinated components is how we'll build the next generation of AI. For those just starting their AI journey, my Awesome-AI guide provides a comprehensive roadmap for mastering machine learning and deep learning, which forms the foundation for understanding these advanced architectures. Further reading: ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration (Su et al., NVIDIA, November 2025) — The foundational paper introducing orchestration with RLSwitch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Fedus et al., Google Research, 2021) — Pioneering work on MoE scalingDynamic Routing Between Capsules (Sabour et al., Hinton, 2017) — Introduction to capsule networks and routing-by-agreementThe Shift from Models to Compound AI Systems (Zaharia et al., Berkeley AI Research, 2024) — Defining compound AI systemsCompound AI Systems Optimization: A Survey (Lee et al., 2025) — Comprehensive survey of optimization methodsOptimizing Model Selection for Compound AI Systems (Chen et al., Stanford/Berkeley, 2025) — LLMSelector frameworkTowards Resource-Efficient Compound AI Systems (Chaudhry et al., 2025) — Resource optimization approaches More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #397

Secrets Management Core Practices

By Apostolos Giannakidis

CORE

Refcard #375

Cloud-Native Application Security Patterns and Anti-Patterns

By Samir Behara

Essential Techniques for Production Vector Search Systems, Part 3: Filterable HNSW

After implementing vector search systems at multiple companies, I wanted to document efficient techniques that can be very helpful for successful production deployments of vector search systems. I want to present these techniques by showcasing when to apply each one, how they complement each other, and the trade-offs they introduce. This will be a multi-part series that introduces all of the techniques one by one in each article. I have also included code snippets to quickly test each technique. Before we get into the real details, let us look at the prerequisites and setup. For ease of understanding and use, I am using the free cloud tier from Qdrant for all of the demonstrations below. Steps to Set Up Qdrant Cloud Step 1: Get a Free Qdrant Cloud Cluster Sign up at https://cloud.qdrant.io.Create a free cluster Click "Create Cluster."Select Free Tier.Choose a region closest to you.Wait for the cluster to be provisioned.Capture your credentials. Cluster URL: https://xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.us-east.aws.cloud.qdrant.io:6333.API Key: Click "API Keys" → "Generate" → Copy the key. Step 2: Install Python Dependencies PowerShell pip install qdrant-client fastembed numpy Recommended versions: qdrant-client >= 1.7.0fastembed >= 0.2.0numpy >= 1.24.0python-dotenv >= 1.0.0 Step 3: Set Environment Variables or Create a .env File PowerShell # Add to your ~/.bashrc or ~/.zshrc export QDRANT_URL="https://your-cluster-url.cloud.qdrant.io:6333" export QDRANT_API_KEY="your-api-key-here" Create a .env file in the project directory with the following content. Remember to add .env to your .gitignore to avoid committing credentials. PowerShell # .env file QDRANT_URL=https://your-cluster-url.cloud.qdrant.io:6333 QDRANT_API_KEY=your-api-key-here Step 4: Verify Connection We can verify the connection to the Qdrant collection with the following script. From this point onward, I am assuming the .env setup is complete. Python from qdrant_client import QdrantClient from dotenv import load_dotenv import os # Load environment variables from .env file load_dotenv() # Initialize client client = QdrantClient( url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY"), ) # Test connection try: collections = client.get_collections() print(f" Connected successfully!") print(f" Current collections: {len(collections.collections)}") except Exception as e: print(f" Connection failed: {e}") print(" Check your .env file has QDRANT_URL and QDRANT_API_KEY") Expected output: Plain Text python verify-connection.py Connected successfully! Current collections: 2 Now that we have the setup out of the way, we can get into the meat of the article. Before the deep dive into filterable HNSW, let us look at a high-level overview of the techniques we are about to cover in this multi-part series. Techniqueproblems solvedperformance impactcomplexityHybrid SearchWe will miss exact matches if we employ semantic search purely.Huge increase in the accuracy, closer to 16%MediumBinary QuantizationMemory costs scale linearly with data.40X memory reduction, 15% fasterLowFilterable HNSWNot a good practice to apply post-filtering as it wastes computation.5X faster filtered queriesMediumMulti Vector SearchA single embedding will not be able to capture the importance of various fields.Handles queries from multiple fields, such as title vs description, and requires two times more storage.MediumRerankingOptimized vector search for speed over precision.Deeper semantic understanding, 15-20% ranking improvementHigh Keep in mind that production systems typically combine two to four of these techniques. For example, a typical e-commerce website might use hybrid search, binary quantization, and filterable HNSW. We covered Hybrid Search in the first part of the series and Binary Quantization in the second part. In this part, we will dive into filterable HNSW. Filterable HNSW To understand how filterable HNSW is advantageous, let us look at how traditional filtering approaches, whether pre- or post-filter, waste computation. Post-filtering discards 90% of retrieved results, whereas pre-filtering reduces the search space so much that vector similarity becomes less significant. That is where filterable HNSW comes in handy, as it applies filters during the HNSW graph traversal. In other words, the algorithm navigates only through graph nodes that satisfy filter conditions. With components such as payload indexes (fast lookup structures for filterable fields), filter-aware traversal (HNSW navigation skips non-matching nodes), and dynamic candidate expansion (automatically fetch more candidates when filters are restrictive), the filterable HNSW is the way to go. Let us take a look at it in more detail with the code below. Python """ Example usage of the filterable_hnsw module. This demonstrates how to use Filterable HNSW with your own Qdrant collection. """ from filterable_hnsw import ( filterable_search, compare_filtered_unfiltered, display_filtered_results, get_qdrant_client ) from dotenv import load_dotenv import os load_dotenv() # Initialize client client = get_qdrant_client() # Your collection name COLLECTION_NAME = "automotive_parts" # Change this to your collection name # Example 1: Filtered search print("=" * 80) print("EXAMPLE 1: Filtered Search (Filterable HNSW)") print("=" * 80) print("Searching: 'engine sensor' with category filter") print("Expected: Finds semantically similar parts within the specified category\n") query1 = "engine sensor" # First get unfiltered results to see what categories exist unfiltered_test1 = filterable_search( collection_name=COLLECTION_NAME, query=query1, filter_conditions=None, client=client, limit=1 ) # Extract category from first result if available if unfiltered_test1 and 'category' in unfiltered_test1[0]['payload']: actual_category1 = unfiltered_test1[0]['payload']['category'] filter1 = {"category": actual_category1} print(f"Using category from data: '{actual_category1}'\n") else: filter1 = {"category": "Engine Components"} # Fallback filtered_results = filterable_search( collection_name=COLLECTION_NAME, query=query1, filter_conditions=filter1, client=client, limit=5 ) display_filtered_results( filtered_results, query1, show_fields=['part_name', 'part_id', 'category', 'description'] ) print("\n\n") # Example 2: Comparison between Filterable HNSW and Post-Filtering print("=" * 80) print("EXAMPLE 2: Filterable HNSW vs Post-Filtering Comparison") print("=" * 80) print("Comparing filtering DURING traversal vs filtering AFTER retrieval") print("Expected: Shows Filterable HNSW is more efficient (no wasted computation)\n") query2 = "brake system" # First get unfiltered results to see what categories exist unfiltered_test2 = filterable_search( collection_name=COLLECTION_NAME, query=query2, filter_conditions=None, client=client, limit=1 ) # Extract category from first result if available if unfiltered_test2 and 'category' in unfiltered_test2[0]['payload']: actual_category2 = unfiltered_test2[0]['payload']['category'] filter2 = {"category": actual_category2} print(f"Using category from data: '{actual_category2}'\n") else: filter2 = {"category": "Braking System"} # Fallback comparison = compare_filtered_unfiltered( collection_name=COLLECTION_NAME, query=query2, filter_conditions=filter2, client=client, limit=5 ) print("\n\n") # Example 3: Display detailed comparison print("=" * 80) print("EXAMPLE 3: Detailed Result Comparison") print("=" * 80) print("Top results from both methods:\n") print("Post-Filtered Results (Top 3):") print("-" * 80) for i, result in enumerate(comparison["post_filtered"]["results"][:3], 1): payload = result["payload"] name = payload.get('part_name', payload.get('name', 'Unknown')) category = payload.get('category', 'N/A') print(f"{i}. {name}") print(f" Category: {category}") print(f" Score: {result['score']:.4f}") print(f" ID: {result['id']}") print("\nFilterable HNSW Results (Top 3):") print("-" * 80) for i, result in enumerate(comparison["filtered"]["results"][:3], 1): payload = result["payload"] name = payload.get('part_name', payload.get('name', 'Unknown')) category = payload.get('category', 'N/A') print(f"{i}. {name}") print(f" Category: {category}") print(f" Score: {result['score']:.4f}") print(f" ID: {result['id']}") print("\n" + "=" * 80) print("SUMMARY:") print("=" * 80) print("Filterable HNSW:") print(" - Filters DURING graph traversal (not before or after)") print(" - Only navigates through nodes that satisfy filter conditions") print(" - No wasted computation - doesn't retrieve then discard results") print(" - More efficient than post-filtering which wastes >90% computation") print(f" - In this example: {comparison['overlap_ratio']*100:.1f}% result overlap") Let us now look at the Filterable HNSW in action with the implementation output Plain Text ================================================================================ EXAMPLE 1: Filtered Search (Filterable HNSW) ================================================================================ Searching: 'engine sensor' with category filter Expected: Finds semantically similar parts within the specified category Using category from data: 'Safety Systems' Filtered Search Results for: 'engine sensor' ================================================================================ Found 5 results 1. Safety Sensor Module 237 Part_name: Safety Sensor Module 237 Part_id: DEL-0000237 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.4092 -------------------------------------------------------------------------------- 2. Safety Sensor Module 240 Part_name: Safety Sensor Module 240 Part_id: BOS-0000240 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.4052 -------------------------------------------------------------------------------- 3. Safety Sensor Module 242 Part_name: Safety Sensor Module 242 Part_id: VAL-0000242 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.4004 -------------------------------------------------------------------------------- 4. Safety Sensor Module 246 Part_name: Safety Sensor Module 246 Part_id: CON-0000246 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.3983 -------------------------------------------------------------------------------- 5. Safety Sensor Module 234 Part_name: Safety Sensor Module 234 Part_id: ZF-0000234 Category: Safety Systems Description: Advanced safety sensor for ADAS applications including collision avoidance and driver assistance fea... Score: 0.3978 -------------------------------------------------------------------------------- ================================================================================ EXAMPLE 2: Filterable HNSW vs Post-Filtering Comparison ================================================================================ Comparing filtering DURING traversal vs filtering AFTER retrieval Expected: Shows Filterable HNSW is more efficient (no wasted computation) Using category from data: 'Braking System' Comparing Filterable HNSW vs Post-Filtering for: 'brake system' Filters: {'category': 'Braking System'} ================================================================================ 1. Post-Filtering (Inefficient) Retrieves many results, then filters AFTER retrieval -------------------------------------------------------------------------------- 2. Filterable HNSW (Efficient) Filters DURING graph traversal - only navigates matching nodes -------------------------------------------------------------------------------- ================================================================================ COMPARISON SUMMARY ================================================================================ Post-Filtering (Traditional Approach): Time: 126.94 ms Results: 5 Approach: Retrieves 50 candidates, discards 45 Top Score: 0.6419 Filterable HNSW: Time: 79.26 ms Results: 5 Approach: Only navigates through nodes matching filter conditions Top Score: 0.6419 Overlap: Common Results: 5 / 5 (100.0%) Filterable HNSW is 1.60x faster Key Difference: Post-Filtering: Wastes computation by retrieving and discarding results Filterable HNSW: Filters during graph traversal - no wasted computation ================================================================================ ================================================================================ EXAMPLE 3: Detailed Result Comparison ================================================================================ Top results from both methods: Post-Filtered Results (Top 3): -------------------------------------------------------------------------------- 1. Brake Control Component 168 Category: Braking System Score: 0.6419 ID: 1794233379 2. Brake Control Component 154 Category: Braking System Score: 0.6396 ID: 3151300734 3. Brake Control Component 176 Category: Braking System Score: 0.6394 ID: 1517692434 Filterable HNSW Results (Top 3): -------------------------------------------------------------------------------- 1. Brake Control Component 168 Category: Braking System Score: 0.6419 ID: 1794233379 2. Brake Control Component 154 Category: Braking System Score: 0.6396 ID: 3151300734 3. Brake Control Component 176 Category: Braking System Score: 0.6394 ID: 1517692434 ================================================================================ SUMMARY: ================================================================================ Filterable HNSW: - Filters DURING graph traversal (not before or after) - Only navigates through nodes that satisfy filter conditions - No wasted computation - doesn't retrieve then discard results - More efficient than post-filtering which wastes >90% computation - In this example: 100.0% result overlap Benefits As you can clearly see from the results, filterable HNSW offers computational efficiency, achieving 1.6 times faster performance. There is also no wasted computation, as you can see from the results, post filtering retrieved 50 items and discarded 45 of them, whereas filterable HNSW only navigated nodes matching the "breaking system" category. The results are also guaranteed for good quality, as you can see from the overlap (all 5 results are identical between methods). Costs For us to be able to execute filterable HNSW, we have a payload index overhead in creating an index for the category, supplier, and in_stock field. For a million parts, we are looking at a minimum of 6% overhead. Also, we need to consider the maintenance aspect of it, as every new part indexed must update the payload indexes. Also to keep in mind is the fact that complex OR conditions may degrade performance on the filtering. Also, payload indexes are kept in RAM for faster access, so there is no need to account for this in capacity planning. When to Use When the results are frequently filteredWhen the filters are selective (reduce results by more than 50%)When the data has categorical/structured metadata When Not to Use When filters are rarely usedFilters are not selective (remove less than 20% of results)Very small datasets (less than 10,000 items Efficiency Comparison Approachcandidates retrievedresults returnedwasted workcpu efficiencyPost Filtering50545 (90%)10% EfficientFilterable HNSW550 (0%)100 % efficient Performance Characteristics Based on the results, let us now look at the performance characteristics Metricpost filteringfilterable hnswevidence from the dataQuery Latency126.94ms79.26ms1.6 times faster Wasted Computation90%0%No wasted computation by filterable HNSWResult Quality0.6419 (top score)0.6419 (top score)100% overlapMemory OverheadBaseline+5-10%Payload indexes for the categories and other fieldsScalabilityDegrades with SelectivityConstant PerformanceMore selective filter, bigger speedup for filterable HNSW Conclusion We have looked at the concept and also the results for filterable HNSW and concluded that the more selective the filters are, the better the output for the results. The bottom line is that if more than 30% of your queries use filterable HNSW, unlike the previous two techniques discussed in the series, filterable HNSW just gives pure gain and no overheads. In the next part of the series, we will look at multi-vector search and its advantages and disadvantages.

By Pavan Vemuri

CORE

5 Technical Strategies for Scaling SaaS Applications

Growing a business is every owner’s dream — until it comes to technical scaling. This is where challenges come to the surface. They can be related to technical debt, poor architecture, or infrastructure that can’t handle the load. In this article, I want to take a closer look at the pitfalls of popular SaaS scaling strategies, drawing from my personal experience. I’ll share lessons learned and suggest practices that can help you navigate these challenges more effectively. 1. Horizontal Application Scaling Horizontal scaling is usually the default strategy once an app reaches moderate traffic. Most SaaS apps run on cloud infrastructure, so spinning up extra instances via auto-scaling is easy. But in many cases, horizontal scaling alone is not enough. I worked on a SaaS project that provided real-time analytics dashboards for e-commerce stores. As we started scaling, the system ran into performance issues. The dashboards were making a lot of requests to the sales data, and the underlying database was reaching its CPU and I/O limits. Adding more app instances only generated more read requests, worsening the problem. To solve this, we combined horizontal scaling of the app servers with read replicas for the database and caching for frequently accessed dashboard data. This way, the app could serve more concurrent users, and the database wasn’t overwhelmed. At the same time, we still took advantage of horizontal scaling to handle traffic spikes. So even if you use proven approaches, remember that scaling a SaaS app requires more than simply adding servers. You must also coordinate strategies across databases, background jobs, and caching layers. 2. Tenant-Based Resource Isolation Multi-tenant resource isolation is another critical strategy for SaaS scaling. While it may seem obvious that all customers share the same system resources, problems often arise when usage patterns vary significantly across tenants. Designing a multi-tenant architecture is challenging on its own, and it becomes even harder when clients have different needs. For example, in one project, I encountered a situation where a few large customers ran campaigns that triggered hundreds of background jobs simultaneously. Even with auto-scaling and sufficient app servers, these tenants consumed most of the queue and CPU resources. We implemented per-tenant limits on concurrent jobs and queue partitions, with dedicated worker pools for heavy tenants. This ensured that high-usage customers could run their campaigns without affecting others’ experience. I also recommend setting up continuous monitoring of tenant behavior and adjusting limits as needed, so no single customer can impact the experience of others. 3. Independent Scaling of Components The main challenge of this approach is maintaining a clear separation of components and responsibilities. Independent component scaling is most effective when workloads have very different characteristics. For smaller systems with uniform traffic, the added operational complexity may not be worth it. The best way to implement independent scaling is to decouple each part of your system so that changes in one component don’t force changes in others. Give each component its own deployment pipeline and implement fault isolation so failures don’t cascade across the system. I often see teams rely solely on CPU or memory usage to decide what to scale. In my experience, it’s far more effective to consider workload-specific metrics such as queue length, requests per second, or processing rate. These metrics directly reflect real demand. 4. API Integrations SaaS apps typically rely on external APIs for payments, notifications, analytics, or third-party integrations. Scaling requires making these integrations reliable, non-blocking, and resilient under load. If you didn’t adopt an API-first design early on, this can be challenging. Here are several best practices. First, move third-party API calls to background jobs. External services are often slow or unpredictable, so offloading these calls keeps user-facing requests fast and allows retries and error handling to happen asynchronously. Next, implement retries with exponential backoff and circuit breakers. This prevents temporary failures from cascading through your system and overwhelming queues or downstream services. It’s also important to cache responses when appropriate. If an API returns relatively static data, caching reduces unnecessary calls and conserves API quotas. 5. Introducing AI When discussing modern scaling strategies, we can’t ignore AI. AI tools can help scale engineering capacity and improve system quality at the same time. Many businesses now use AI-assisted workflows to improve code quality, testing, and deployment. In my experience, AI can be a major help. As systems grow, codebases become more complex. AI can analyze code, identify overly complex functions or duplicated logic, and suggest refactoring before technical debt accumulates. I’ve found AI particularly useful for testing, which is often a bottleneck when scaling. My team uses GitHub Copilot to generate tests for recent code changes, helping us maintain coverage without writing every test manually. That said, it’s important to remember AI’s limitations. Always combine AI-generated tests with human review for edge cases, and regularly check coverage to ensure nothing is missed. Final Thoughts It’s important not to fall into the trap of default solutions. Every SaaS application presents unique scaling challenges, and success depends on adapting well-known practices with techniques and technologies that fit your context. Some applications struggle with database scaling, others with API performance, and still others with operational complexity or team coordination. The key is to identify your system’s real bottlenecks and build strategies that address them directly.

By Mykhailo Kopyl

Modernizing Applications with the 7 Rs Strategy – A CTO's Guide

Think about the time CTOs spent most of their time fixing old systems. Updates were slow, servers were expensive, and adding new features took time. Now, things have changed. Cloud technology applications can grow fast, collaborate, and meet business demands quickly. CTOs now focus on enhanced systems and helping businesses grow. Using the right app modernization strategy and modernization frameworks, old systems can become faster, safer, and more useful. In this blog, we will explain the 7 Rs of application modernization. You will learn how to update legacy applications, reduce risk, leverage application modernization services, and follow simple steps for success. What Are the 7 Rs of App Modernization? Every organization has a mix of old and new applications. Some still work well, while others hold back progress. The 7 Rs of application modernization help CTOs decide what to do with each system: keep, improve, move, or replace. Let’s look at each of the 7 Rs in simple terms:- Retire What Does Retiring an App Mean? Retiring means removing applications that are old-fashioned, unused, or no longer support business goals. It helps simplify your IT environment and reduce unnecessary costs. When Should You Consider Retiring a Legacy System? When a legacy system becomes costly to maintain, has security risks, or is replaced by modern tools in your app modernization strategy, it's time to retire it.. Business Benefits of Letting Go of Outdated Tools Retiring legacy systems enables organizations to save money, reduce risk, and focus resources on modern technology as part of a strong application modernization roadmap. Example Case Study A retail company retired its legacy inventory-tracking app after migrating to a cloud-native modernization platform. This reduced server costs and improved real-time stock visibility. Retain What Is Retaining in App Modernization? Retaining means keeping an existing system that performs well and supports business operations. How to Decide if an App Is Still Useful As Is If an app is stable, secure, and fits your modernization framework, it may not require immediate changes. Risks and Rewards of Keeping Legacy Systems Unchanged The reward is cost savings and stability. The risk is falling behind in performance or missing out on new features of enterprise app modernization. Example Case Study A logistics company decided to retain its ERP system because it met all enterprise needs and instead focused on modernizing its legacy applications for customer service. Rehost What Is “Lift and Shift” Cloud Integration? Rehosting means migrating your application to the cloud while preserving its code and features, without refactoring. It is usually the early stage in a cloud integration strategy. Why Rehosting Is a Quick Win for CTOs Rehosting allows faster cloud adoption without risk and helps teams realize the benefits of app modernization quickly. When Rehosting Makes Sense for Your Business When your system works well but needs better speed, scalability, and security through cloud-native modernization. Example case study A healthcare company has migrated its payment system to Azure, improving productivity and streamlining management. Replatform What is the Difference Between Replatforming and Rehosting? Replatforming is the process of migrating an app to the cloud and making various changes to improve performance. Unlike rehosting, some adjustments are made to improve compatibility and performance. How to Upgrade Without Rewriting Your App By updating certain components, such as the database or web server, while keeping the core logic the same. Business Value of Moving to a Better Platform Replatforming can make the app faster, more efficient, and easier to update without reformation. This makes it an important step in modernizing outdated systems. Example Case Study A retail company moved its online store to a new system. This improved website speed and reduced maintenance costs. Refactor What Is Refactoring? Refactoring means improving the structure of your code without introducing additional changes. Why Improving Code Matters for Performance When the code is clear, the app runs faster, is easier to fix, and remains free of problems. When Refactoring Helps Reduce Tech Liability When outdated code slows down development or affects the quality of modernizing legacy applications. Example Case Study A software firm reorganized its sales management system, making it easier to maintain and improve customer response times. Rearchitect What Is Re-Architecting? Rearchitecting means reorganizing an application to meet future desires. It often involves moving from a monolithic to an adaptable cloud architecture. How Re-Architecting Your App Supports Scalability It allows the system to handle more users, integrate easily, and support digital transformation for CTOs. When to Consider Re-Architecting for Long-Term Growth When the current system cannot support expansion, high performance, or security goals. Example Case Study A banking company rearchitected its transaction system using Azure app services. This helped process more transactions faster and more reliably. Replace What Does Replacing an App Involve? Replacing means completely removing an outdated system and selecting a new one that better meets business needs. Why Starting Fresh Can Be the Smartest Move Replacement gives organizations access to modern features, reduces maintenance, and aligns with modernization best practices. How to Choose the Right Replacement Solution Select a cloud-based system that supports your enterprise app modernization goals and can grow your business. Example Case Study An insurance company replaced its outdated claims system. The new system improved data precision and customer satisfaction while simplifying management. Modernizing Applications with the 7 Rs Strategy – Summing Up The 7 Rs strategy of app modernization helps CTOs build a clear modernization framework. From retiring old systems to replacing them with modern cloud solutions, each step supports faster, safer, and smarter business operations. CTOs can simplify legacy system transformation, improve performance, and reduce long-term costs. It is not just about moving to the cloud but about creating a flexible system that grows with your business. FAQs How Do CTOs Plan Application Modernization? CTOs start by assessing current systems, identifying performance gaps, and choosing the right 7 R approach based on business goals and technical needs. 2. Which Strategy is Best for Legacy App Modernization? The best strategy depends on your goals. Some systems only require rehosting, while others may require rearchitecting or replacement to improve performance. 3. What is the Difference Between Rehosting and Re-Architecting? Rehosting moves applications to the cloud with minimal changes, while re-architecting redesigns the system to make it more scalable and efficient. 4. Why is App Modernization Important for CTOs? App modernization enables CTOs to reduce maintenance costs, enhance system performance, and prepare their business for future digital transformation needs.

By Pranay Parmar

Cognitive Load-Aware DevOps: Improving SRE Reliability

The site reliability engineering (SRE) community has tended to view reliability as a mechanical problem. So we have been meticulously counting "nines," working on the failover groups, and making sure our autoscalers have all the least settings they need. But something appears to be metamorphosing threateningly: people are becoming increasingly lost in high-availability metrics like 99.99%, which seemingly mask an infrastructure that would melt like butter if not for humans stepping in manually. We have reached the maximum level of complexity. Modern cloud-native ecosystems, including microservices, temporary Kubernetes pods, and distributed service meshes, are experiencing an exponential growth in the amount of traffic they handle. While the infrastructure continues to scale up and down at will, our human cognitive bandwidth, as defined by Miller's Law, simply cannot keep up. We are trying to manage state spaces that approach infinity with something as minimalist as biological bandwidth. That is why service-level indicators (SLIs) and service-level objectives (SLOs) in their standard guise fail. They evaluate the system's results without considering the mental processes required to resolve incidents. If an engineer is cognitively overloaded, their mental model of the system will break down, leading to heuristic errors and "action paralysis." If our mean time to understand (MTTU) keeps rising as we achieve our SLOs, we are creating a Reliability Paradox in which a system becomes so sophisticated that the instruments proposed to provide observability actually increase the noise-to-signal ratio, thereby diverting cognitive capacity from fixing the technical problem. If we are to improve reliability, we must not rely on telemetry alone; we must shift our focus to protecting the engineer's cognitive throughput. The Framework: Cognitive Load Theory in a DevOps Context The use of Cognitive Load Theory (CLT) could be a potential approach to address the Reliability Paradox. When applied to site reliability engineering (SRE), this psychological framework divides mental effort into three types: Intrinsic load: The task involves processing a significant amount of information, such as grasping how a distributed consensus algorithm works. It is contained within and part of the task itself.Extraneous load: The opposite end of the unnecessary cognitive burden spectrum. This is only friction, borne of the painful parsing of fragmented logs, complex boilerplate manifest configurations, or turbulent infrastructure.Germane load: The good load. This alludes to what is actively built in cognitive space to construct and maintain a highly detailed mental model or perform high-quality root-cause analysis (RCA). Visualizing the Mental Capacity The following chart explains how irrelevant stresses interfere with meaningful and efficient work. Fig 1: Visual representation of the SRE Cognitive Theory: Categorizing an engineer’s mental capacity into Extraneous Load, Intrinsic Load, and Germane Load. The Technical Cost of Friction Moving engineers to interface with low-level abstractions instead of operating at higher layers increases Extraneous Load. This exposes them to the 'Leaking Abstraction' issue, where the brain must manually manage the infrastructure's state, thereby depleting mental resources that could be used for critical problem-solving from the start of an incident. High Extraneous Load (Manual and Error-Prone) The task involves managing various cluster contexts, complicated JSON formatting, and internal container labels that do not significantly aid in the primary goal of service patching. High friction: Mental overhead spent on syntax and environment state: Python spec = { "template": { "spec": { "containers": [ { "name": "web", "image": "v2.1.0" } ] } } } print(spec) Low Extraneous Load (The "Paved Road") We can utilize a simple code or a layer of high-level abstraction to ease the load through focusing on the "what," hence the "how," we would foster the engineer to stay in a state of uninterrupted concentration. It is important for the tooling to manage the finer details to prevent syntax issues and reduce manual errors. Low friction: Direct expression of intent with automated safety checks: Python spec = { "specification": { "template_details": { "specification": { "container_list": [ { "container_name": "web_app", "version": "v2.1.0" } ] } } } } print(spec) Our strategic objective is to become known. As SREs move forward with the ruthless extermination of Arbitrary Load to add slack, they keep the germane entities intact, reducing distractions so SREs can conduct deep-dive learning and system hardening. Identifying Cognitive "Hotspots" in the SRE Lifecycle When establishing a reliability strategy that respects natural constraints, it is essential to locate where cognitive load collects. In high-velocity site reliability engineering (SRE) organizations, these "hotspots" usually reach a fever pitch in these three high-friction domains: The Tooling Tax: The Cost of Context Switching When an engineer flips between a time-series database for metrics and another cluster for stack traces, the "Tooling Tax" is levied. Each of these flips signifies a complex path towards the normalization of totally different query languages. The mental effort involved in translating thoughts into code involves continuously switching between conceptual understanding and syntactic structures, rather than experiencing a unified view. Notice the radical shift in syntax for the same logical intent: Python import requests class PaymentAPIErrorInvestigation: def __init__(self, prometheus_url, kibana_url, jaeger_url): self.prometheus_url = prometheus_url self.kibana_url = kibana_url self.jaeger_url = jaeger_url def check_metrics(self): promql_query = ( "sum(rate(http_requests_total{service=\"payment-api\", status=~\"5..\"}[5m])) by (endpoint)" ) print(f"Executing PromQL query for 500-series errors: {promql_query}") response = requests.get(f"{self.prometheus_url}/api/v1/query", params={"query": promql_query}) return response.json() def correlate_logs(self): lucene_query = "service: \"payment-api\" AND status: [500 TO 599] NOT level: \"INFO\"" print(f"Executing Lucene/KQL query for 500-series errors: {lucene_query}") response = requests.get(f"{self.kibana_url}/logs/_search", json={"query": lucene_query}) return response.json() def check_traces(self): spanql_query = "tags='{\"error\": \"true\", \"http.status_code\": \"500\", \"component\": \"proxy\"}'" print(f"Executing SpanQL query for 500 errors in proxy component: {spanql_query}") response = requests.get(f"{self.jaeger_url}/api/traces", params={"query": spanql_query}) return response.json() def investigate(self): print("Starting Payment API 500-series error investigation...\n") # Checking metrics metrics_data = self.check_metrics() print(f"Metrics Data: {metrics_data}\n") # Correlating logs logs_data = self.correlate_logs() print(f"Logs Data: {logs_data}\n") # Checking traces traces_data = self.check_traces() print(f"Traces Data: {traces_data}\n") print("Investigation completed successfully.") # Initialize with appropriate server URLs prometheus_url = "http://<prometheus-server>" kibana_url = "http://<kibana-server>" jaeger_url = "http://<jaeger-server>" # Create an instance of the investigation class investigation = PaymentAPIErrorInvestigation(prometheus_url, kibana_url, jaeger_url) # Run the investigation investigation.investigate() Alert Fatigue: Signal-to-Noise Exhaustion The constant high-volume alerting occupies the prefrontal cortex, the brain area responsible for advanced problem-solving, continuously. When on-duty personnel are bombarded with frequent, distracting alerts, their brains instinctively rely on simple, quick mental shortcuts. Figure 2: Visualizing the "Cognitive Battery," the depletion of mental bandwidth from peak analytical capacity to critical heuristic survival mode. The Color Spectrum of Cognitive Load Green (Peak Capacity): High Germane load availability. The engineer is capable of conducting a thorough investigation into the root cause through deep-dive analysis and also strengthening the systems proactively through hardening.Yellow/Orange (Alert Noise Drain): High extraneous load. Mental activity is consumed by the activity of "Toil," which begins with alerts emanating from points of little context and then culminates in unique downtime to give a run for no practical purpose.Red (Critical Exhaustion): Cognitive bankruptcy. The brain, set to "survival mode" as its default, draws on muscle memory and takes the easy way out. It is at this point that high-risk manual mistakes and "Action Paralysis" are most frequent during outages. Documentation Debt: The Paradox of Information At its core, reliability is merely a search-and-retrieval problem. SREs during critical outages have to deal with the Information Paradox: they have access to vast amounts of documentation but have no actionable answers. A key cause of this issue is 'Documentation Debt,' which indirectly increases the mean time to understand (MTTU), the primary obstacle in the entire incident response procedure. This can be fixed by applying the "30-Second Rule”: If a runbook or wiki requires more than 30 seconds to read for a clear, technical recovery step, it has failed as a reliability tool. In this instance, the documentation is no longer a support tool but a source of Extraneous Load, compelling the engineer to manually "garbage collect" on obsolete data while the system is down. Strategy 1: Platform Engineering and the "Paved Road" To scale reliability, it is necessary to go beyond manual intervention and embrace Platform Engineering as a cognitive shield. The decoupling of SRE from the system through the provision of Abstraction as a Service reduces the "surface area" that SRE must master. The implementation of "sensible defaults" reduces the mental effort required to enter a new service context; when every service follows the same deployment pattern and naming convention, the brain relies on pattern recognition rather than manual discovery. The switch from "Ticket-Ops" to self-service infrastructure is indispensable for the flow state. The substitution of human-in-the-loop requests with API-driven workflows removes the context switching associated with high latency. This "Paved Road" strategy guarantees that the most reliable way to deploy is also the easiest, thus transforming the platform into a reliability engine that automatically handles Extraneous Load. Strategic Shift: From Low-Level Primitives to High-Level Intent The transition from infrastructure drudgery management to self-service and high-level abstraction is expounded upon in the comparisons below: Python import subprocess class RDSInstanceCreator: def __init__(self, db_identifier, db_class, engine, storage, security_group_id, publicly_accessible): self.db_identifier = db_identifier self.db_class = db_class self.engine = engine self.storage = storage self.security_group_id = security_group_id self.publicly_accessible = publicly_accessible def create_rds_instance(self): print(f"Creating RDS instance for {self.db_identifier}...") # AWS CLI command to create RDS instance aws_command = [ "aws", "rds", "create-db-instance", "--db-instance-identifier", self.db_identifier, "--db-instance-class", self.db_class, "--engine", self.engine, "--allocated-storage", str(self.storage), "--vpc-security-group-ids", self.security_group_id, "--publicly-accessible", str(self.publicly_accessible).lower() ] # Execute the command subprocess.run(aws_command, check=True) print(f"RDS instance {self.db_identifier} created successfully.") class PavedRoadRDSCreator: def __init__(self, service_name, environment, tier): self.service_name = service_name self.environment = environment self.tier = tier def create_rds_instance(self): print(f"Creating RDS instance for service: {self.service_name} ({self.environment} environment, {self.tier} tier)...") # Simulate the self-service command platform_command = f"platform create rds --service {self.service_name} --env {self.environment} --tier {self.tier}" # Print the simulated command print(f"Simulated command: {platform_command}") # Here, you can call a method to execute the real service if needed. # For example: subprocess.run(platform_command, shell=True) # Old: Manual Infrastructure Setup rds_creator_old = RDSInstanceCreator( db_identifier="payments-db-prod", db_class="db.t3.large", engine="postgres", storage=20, security_group_id="sg-0a1b2c3d", publicly_accessible=False ) rds_creator_old.create_rds_instance() # New: Self-Service Infrastructure Setup rds_creator_new = PavedRoadRDSCreator(service_name="payments", environment="prod", tier="critical") rds_creator_new.create_rds_instance() The move to platform-escalated technical execution frees SRE minds to focus on high-level system behavior and architectural hardening. Strategy 2: Modern Observability and AIOps The journey from "Monitoring" to "Modern Observability" signifies a shift from tracking certain outputs to the introspection of hidden processes. To preserve the cognitive energy of the SREs, the move must be from raw data dumps to intent-based insights. 1. AIOps as a Noise Filter As a crucial step in templating the Extraneous Load, AIOps is not an "installation that automatically solves problems." Thousands of flapping events can coalesce into a single high-signal incident by reasoning with an extreme machine-learning capability for anomaly correlation and deduplication. This ensures that, in the event of an outage, the first interaction an engineer has is not log-reviewing, which can be a needless hassle. 2. Service Maps: Exploiting Spatial Memory The human brain's processing speed for visual data exceeds that for text by a factor of 60,000. This neatly sidesteps any need for an SRE to mentally reconstruct a microservices architecture from log lines, so long as the spatial memory retained through the Dynamic Service Maps can immediately show where a failure occurred. 3. Curated, Incident-Specific Dashboards Universal dashboards often generate cognitive resistance during use. Modern observability platforms should provide contextual views that, whenever an SLO is breached, will self-trigger. For instance, if the "Payment Latency" SLO is frowning, the SRE should not be shown CPU metrics for "Marketing Blog"; rather, they would see just the crucial telemetry data for the payment path. By curating visual insights from vast, high-cardinality data, we can reduce MTTU (Mean Time to Understand) and prevent cognitive brownout during critical escalations. Figure 3: Transforming high-cardinality telemetry into high-signal insights through automated correlation and visual topology. Strategy 3: Human-Centric SRE Practices We need to shift our thinking from managing machines to managing people to maintain reliability. The human-centric SRE approach considers mental capacity the most vital resource in the system. Bounded-context teams, by providing roles with essential leadership, bring the right tools and designate the right gaps such that SRE remains focused on the existing problem landscape.Toil Budgets: Human mental energy must be viewed as a limited resource, like an Error Budget. If "Toil"—doing repetitive, manual, low-value work — crosses a defined threshold, the team must stop feature work and devote itself to reducing "cognitive debt."The "30-Second Rule": Any critical technical step for recovery must be deduced and acted upon within the first 30 seconds after an engineer is paged. If an SRE is taking minutes to find a runbook, documentation has failed, contributing to Extraneous Load. Measuring Success: Metrics That Matter Typical SLIs/SLOs are lagging indicators of the system's health; we need indicators of Engineer Health. Developer experience (DevEx) and MTTL: The mean time to learn (MTTL) measures the time taken for a cross-trained engineer to become incident-ready on the new service. High MTTL means the system is too complex to operate reliably.Context-switching frequency: To preserve the SRE flow state, the "Interruption Rate" is tracked frequently. A fast-paced context-switching between non-critical tasks and deep work is the primary driver of cognitive bankruptcy. Implementation: Tracking Context-Switching via CLI The script below, as an example, shows how companies could systematically track "interruptions" by recording non-essential manual interventions during working sessions, rather than during deep work. Python # SRE Context Tracker - Measuring Cognitive Interruptions(PYTHON) import time def log_event(event_type, duration_minutes): """ Tracks whether work is 'Flow State' (Germane) or 'Interruption' (Extraneous). Aim for a Flow: Interruption ratio of > 4:1. """ print(f"Logging {event_type} for {duration_minutes}m...") # In a real scenario, this would push to an internal DevEx dashboard # to calculate the 'Interruption Rate' metric. # Example: An SRE shift log_event("DEEP_WORK_SYSTEM_HARDENING", 120) # Germane Load log_event("JIRA_TICKET_INTERRUPTION", 15) # Extraneous Load (Context Switch) log_event("SLO_DASHBOARD_CURATION", 45) # Germane Load Looking into human-centric metrics allows us to invest in High-Availability Cognition so that, whenever systems fail, humans are prepared to make sense of these failures and resolve them. Conclusion: The Future of SRE Is Socio-Technical The development of site reliability engineering has reached a crucial point where machine scalability is itself insufficient. One must argue that reliability is not a simple feature of our code or our cloud infrastructure, but a sociotechnical property of the whole system, where the human mind lies at its foundation. Working to build a reliable system challenges users to consider the human brain as a primary component with fixed biological constraints, and to make direct claims that our technological decisions are not allowed to go beyond human reasoning. Throwing millions of dollars at scaling the clusters, which are forever optimized with autoscalers and database shards for near-infinite traffic, is all for naught if the humans behind these systems operate under cognitive bankruptcy. We need to turn our full focus on high-availability cognition to scale engineers' ability to think clearly, recognize patterns, and maintain and quickly evolve their mental models of the systems they operate. Progress begins with a new outlook during key learning opportunities. Consider conducting a 'Cognitive Load Audit' as part of our upcoming post-incident review or after-action report. Do not merely ask what broke; ask what made the problem difficult to understand. By identifying and ruthlessly automating the extraneous friction found during these audits, we can ensure that our systems remain not only up and running but also truly understandable and resilient in the face of ever-increasing complexity.

By Oreoluwa Omoike

Automating AWS Glue Infra and Code Reviews With RAG and Amazon Bedrock

In many enterprises, the transition from a "working" pipeline to a "production-ready" pipeline is gated by a manual checklist. In most enterprises, a “simple” Glue review involves answering questions like: Is the Glue job deployed?Was it provisioned via CloudFormation?Does the expected crawler exist?Is the code production-grade?Does it follow internal best practices? Traditionally, a senior engineer would spend 4–6 hours per use case and manually: Cross-references CloudFormationOpens job scriptsReviews against a checklist stored somewhereWrites feedback by hand In a fast-scaling organization, this creates a bottleneck. To solve this, I developed a utility that implements shift-left governance — moving the review process from the final "pre-production" stage directly into the development cycle. So instead of waiting for a Senior Engineer to find a missing Crawler or a hardcoded S3 path near production deployment, the developer runs an automated script during development. Proactive vs. reactive: Issues are fixed while in the development phase.Cost reduction: It is significantly cheaper that 4-6hrs a senior engineer would spend.Empowerment: Developers receive instant, expert-level feedback without waiting on a human reviewer. The Solution Architecture: RAG-Powered Governance The core of this implementation is a retrieval-augmented generation (RAG) pattern. While LLMs are smart, they don't natively know your specific "Enterprise Internal Standards." The RAG Approach: Grounding the Review Instead of hardcoding rules into a prompt, we store the Enterprise Best Practices Checklist as a document in Amazon S3. Design Principle: Deterministic Infrastructure Validation Before Probabilistic Inference The key insight was simple: A Glue job review is meaningless without infrastructure context. So instead of building “another code reviewer,” let's build a system that: Starts with real deployed infrastructureVerifies it against enterprise standardsReviews code only if infra checks passUses GenAI with strict grounding The High-Level Architecture At a glance, the system works like this: The reviewer runs a local CLI command with a use case name.The system discovers all Glue jobs for that use case.It validates: Glue job existenceDeployment statusCloudFormation provenanceExpected crawlersIt retrieves enterprise review standards from Amazon S3, along with the glue job script to be reviewed against this checklist.It runs a GenAI-powered code review.It generates Markdown reports per job. All fully automated. Step 1: Discover Glue Jobs Automatically Instead of hardcoding job names, I used the AWS Glue Data Catalog to discover jobs associated with a use case. This allows the review process to: Stay decoupled from naming conventionsWork across environmentsScale as pipelines grow Step 2: Infrastructure Validation (The Missing Piece) Before touching the code, the system validates the infrastructure. For each Glue job, it checks: Does the job exist?Is it deployed?Was it created via CloudFormation?Does required crawler exist? Example output: Plain Text Glue Job Exists - Resource: Glue Job (<gluejob name>) Status: Deployed Source: IaC (CloudFormation: <stack name>) Compliance: Project standards met If something is missing: Crawler Doesn't Exist <crawler name> Expected - Compliance: Project standards not met Step 3: The RAG Approach (Grounding the review) Once infra checks pass, the system sends the script and the RAG-retrieved checklist to a Claude Sonnet 3.5 LLM hosted on Amazon Bedrock. Here’s how: The enterprise checklist is stored in Amazon S3Relevant checklist is retrieved at runtimeGlue job script is retrieved from job metadata at runtime.They are injected directly into the LLM promptThe LLM evaluates code only against retrieved standards The model does not invent rules. It reasons strictly within enterprise-defined constraints. This allows the governance rules to evolve (e.g., adding a new requirement) just by updating an S3 file, without changing a single line of the review code. Why Not Free-Form GenAI? Unconstrained GenAI: Hallucinates standardsProduces inconsistent feedbackIs impossible to audit That’s unacceptable in an enterprise. What the Code Review Output Looks Like Each Glue job gets a Markdown report like this: Markdown ## Code Structure & Readability Is the code modular and function-based? - Status: NEEDS IMPROVEMENT - Explanation: Logic is implemented in a single main function. But there is repetitive code for listing objects in S3 buckets. - Recommendation: Create a separate function for listing S3 objects to reduce repetition. Every checklist item includes: Pass/fail statusClear explanationActionable recommendation Exactly how a senior engineer would review — but consistently. Performance and Productivity Gains By shifting governance left and using RAG, we can turn a slow, expert-driven, manual process into a fast, deterministic, self-service workflow: metricmanual processrag-automated processimprovementReview Time~4 hrs~3-4 Minutes98% Faster Rule Updates Manual Re-trainingUpdate S3 FileInstantReviewer dependencySenior engineerSelf-serviceReallocation of senior expertise to high-impact architectureConsistencyHuman-dependent100% Policy-alignedNo subjective biasReview qualityVariableStandardisedInstitutionalisation of "Best Practices.CostHigh-Cost Engineering HoursNominal API/Compute Cost. (<$0.10 per PR via Amazon Bedrock)>99% Reduction in Direct Review ExpenseAudit trailManually 1 to 1 shared review commentsMarkdown artifactsHigh-fidelity visibility for stakeholders Net result: ~95% reduction in review time. Conclusion: Governance as a Service This implementation proves that "Enterprise Standards" don't have to be a manual burden. By utilizing a RAG approach with Amazon Bedrock, a living governance engine can be created that executes in minutes and is active from the development phase itself. This allows senior engineers to focus on architecture and innovation, while the "bottleneck" of checklist-checking is eliminated. The code for this implementation is available in my GitHub repository. Clone URL: https://github.com/chhabrapooja/infra-aware-code-review-automation.git I will make the repository public following the publication of this article. Why This Works (and Scales) Some reasons are: Infra comes first. Most review tools ignore infrastructure. This one starts there.Grounded GenAI implementation. RAG ensures correctness, consistency, and auditability.Standards are decoupled from code. Updating the checklist requires no redeployment.Decoupling infrastructure by lifecycle and/or use case. Instead of maintaining a single stack, we implemented granular stack segmentation based on the specific pipeline stage to locate specific resources instantly. Future Roadmap CI/CD Integration: Trigger reviews automatically on GitHub Pull Requests.

By pooja chhabra

From Monolith to Modular Monolith: A Smarter Alternative to Microservices

Somewhere around 2015, microservices became gospel. Not a pattern — gospel. You decomposed or you died, architecturally speaking. The pitch was seductive: independent scaling, polyglot persistence, team autonomy that meant engineers could ship without waiting on Gary from the payments team to merge his pull request. Entire conference tracks emerged. Consultants got rich. And a lot of systems got worse. Not all of them. Some genuinely needed the distributed model — genuine scale pressures, organizational boundaries that mapped cleanly to service boundaries, teams mature enough to eat the operational cost without choking. But most? Most were mid-sized SaaS platforms or internal tools that adopted microservices because the narrative was so ubiquitous it felt like technical malpractice not to. Now we're seeing the retrenchment. Not a full retreat — nothing so dramatic — but a recalibration. Teams are building modular monoliths, or migrating back to them, with the kind of quiet determination that suggests they've seen some things. They've debugged distributed traces at 3 a.m. They've watched a deployment pipeline that used to take eleven minutes explode into forty because now there are nineteen services and the dependency graph looks like a neural network designed by someone on hallucinogens. This isn't nostalgia. It's arithmetic. The Microservices Tax Nobody Warned You About Here's what they don't put in the Medium posts: microservices are a distributed systems problem. Full stop. And distributed systems are where simplicity goes to die. You get network latency — suddenly every function call traverses a wire, hits a load balancer, maybe retries if the downstream pod is restarting. Partial failures become your daily weather. Service A succeeds, Service B times out, Service C returns a 503 because someone deployed during lunch, and now you're in this purgatorial state where the order is half-committed and the user is refreshing the page wondering if their credit card got charged twice. Data consistency? Gone. You wanted strong consistency? Should've stayed in-process. Now you're running Saga patterns or two-phase commits or — God help you — eventual consistency with compensating transactions that fail compensatingly. Your deployment pipeline fractures into this Rube Goldberg contraption. Helm charts. Kubernetes manifests. Service mesh configurations. Fifteen repositories each with their own CI/CD config, their own versioning strategy, their own flake rate on integration tests. Teams ship independently, sure, but they coordinate constantly because Service Q depends on a behavior in Service R that changed in version 2.1.3 and nobody documented it because who has time to document every internal contract when you're moving fast and breaking things? Observability becomes its own discipline. Distributed tracing. Log aggregation across clusters. Metrics that make sense only if you squint and know which Grafana dashboard Karen built last quarter. Debugging requires archaeology — tracing a request ID through six services, three message queues, and a Redis cache that may or may not have stale data depending on whether the TTL expired. Infrastructure costs balloon. You're running a service mesh, a distributed tracing backend, centralized logging, a secrets manager, and probably a service catalog because nobody remembers what customer-prefs-svc actually does anymore. Each service needs its own database — right? Right. Except now you have seventeen PostgreSQL instances and your AWS bill looks like a phone number. The CNCF surveys and ThoughtWorks Radar reports all say the same thing, in polite consulting-speak: teams systematically underestimate these costs. Especially when the system doesn't justify them. Especially when you have eight engineers total and you just split your application into microservices because that's what the architecture diagram at that conference looked like. What you get — and I've seen this enough to know the pattern — is: Tight coupling anyway. Services share database schemas. Or they call each other synchronously in ways that make the network boundaries feel like a formality.Chatty communication. Because you carved the boundaries wrong—turns out "users" and "preferences" really needed to be colocated—so now every request involves four round-trips.Shared databases. The anti-pattern everyone warns about, but which happens because splitting the data model cleanly is harder than splitting the codebase.Coordination overhead. Standup takes longer. Planning takes longer. Deploys require Slack threads to make sure nobody else is deploying. This isn't agility. It's accidental complexity with better PR. The Modular Monolith: What It Actually Is A modular monolith is a single deployable artifact — one process, one runtime — that is ruthlessly, structurally divided into well-defined modules with enforced boundaries. Not a "big ball of mud." That's the lazy monolith, the one where everything references everything and the dependency graph is a directed cyclic catastrophe. The modular version applies domain-driven design with teeth. Bounded contexts aren't aspirational; they're enforced. Modules own their data. They expose explicit interfaces. Internal implementation stays internal, protected by language visibility rules or architectural tests that fail your build if someone tries to reach across the fence. In practice: Clear domain boundaries. Each module represents a cohesive business capability: billing, inventory, notifications. Not "utils" or "helpers" or "shared."No ambient state sharing. Modules don't reach into each other's databases or internal classes. Communication happens through defined contracts — interfaces, events, explicitly published APIs.High cohesion, low coupling. The stuff that changes together lives together. The stuff that's independent stays independent, even though it's compiled into the same artifact.Single deployment unit. One JAR. One container. One thing to version, one thing to deploy, one rollback target. You enforce this through: Package visibility rules. Java's module system. C#'s internal access modifiers. Whatever your language gives you to hide things.Architectural fitness functions. Tools like ArchUnit that fail your build if someone adds an illegal dependency.Dependency inversion. Modules depend on abstractions, not implementations.Clear ownership. Each module has a team or individual responsible for its contracts. It sounds simple because it is. But simple isn't easy — it requires discipline. Why This Works Better Than You'd Expect In-Process Calls Are Absurdly Fast No network. No serialization. No retry logic. When one module calls another, it's a function call. Nanoseconds. The error handling is try-catch, not circuit breakers and bulkheads. You don't need distributed tracing to figure out why something failed — you have a stack trace. One stack trace. In one process. This isn't trivial. Eliminating the network boundary removes an entire class of failure modes. Systems become legible again. Deployment Gets Boring (In a Good Way) One artifact means one CI/CD pipeline. Build, test, package, deploy. Rollback is a single operation — redeploy the previous version. No orchestration across twelve repositories. No "Service F is on v2.3 but Service G needs v2.4 so we're in this weird compatibility purgatory." Versioning becomes sane. You version the whole thing. Breaking changes are internal refactorings, not cross-service API migrations requiring coordinated deploys and backward-compatibility shims. Lead time for changes — the DORA metric everyone cares about — improves because you're not waiting on three other teams to merge their changes before yours can go live. Domain Modeling Gets Real When everything's in one process, premature extraction is harder. You're forced to think through the bounded contexts before you draw lines. This is good. Most teams carve boundaries too early, based on guesses about scale or team structure, and then spend the next two years dealing with the consequences. A modular monolith lets the domain model stabilize. You discover where the natural seams are — not where you thought they'd be, but where they actually are, revealed through usage, change patterns, performance profiles. When you finally do extract a service, it's because you have evidence: this module has different scaling characteristics, or this team has genuinely divergent release cadences, or this data needs regulatory isolation. The abstractions are cleaner. Stable. Less churn. Microservices Become a Choice, Not a Default Here's the strategic part: a well-built modular monolith is microservices-ready. Each module is already isolated. It has its own data contracts, its own domain logic, its own interface. When you need to extract it — genuinely need to, with evidence — you: Move the module to its own repositoryGive it a databaseWrap it in an HTTP or gRPC APIUpdate the monolith to call it remotely This is the Strangler Fig pattern, done right. You're not rewriting the world. You're selectively extracting components under pressure, with clear motivations: this module needs to scale independently, or this team needs deployment autonomy, or this functionality has genuine latency-sensitive requirements. The risk drops precipitously because you're not guessing. You're reacting to measured need. What the Data Actually Shows This isn't theory. Shopify has talked openly about the cost of their microservices sprawl — hundreds of services, coordination overhead eating velocity, performance degraded by inter-service chatter. GitHub's engineers have written about similar challenges. These aren't small companies. They're platforms operating at legitimate scale, and even they've found that not every problem needs a distributed solution. ThoughtWorks' Technology Radar — one of the more sober assessments in the industry — has repeatedly flagged modular monoliths as a sensible default for new systems. Not a fallback. A default. The baseline from which you deviate only with justification. Internal platform teams, the ones running hundreds of services in production, increasingly report that stability and developer productivity improve when service extraction happens reactively, not proactively. You split when the pain of not splitting exceeds the pain of splitting. Before that threshold, you're just pre-optimizing for scale you don't have and organizational boundaries that haven't ossified yet. The pattern is consistent: architecture should follow evidence, not fashion. When You Actually Need Microservices Modular monoliths aren't universal. There are scenarios where distribution is justified from the start: Large, independent teams. If you have fifty engineers and clear product boundaries, independent deployability might be worth the coordination cost.Extreme scale requirements. Genuine traffic spikes that require horizontal scaling of specific components, not the whole app.Regulatory isolation. PCI compliance boundaries, multi-tenancy requirements that demand physical separation.Polyglot necessity. Rare, but real: sometimes you genuinely need Python for ML inference and Go for a low-latency API and neither can compromise. The difference is timing and intent. You're choosing distribution because of a concrete pressure, not because it's 2025 and microservices are still cool. The Question Worth Asking The shift to modular monoliths represents something subtler than a pendulum swing. It's a maturation of architectural thinking — a recognition that complexity has a cost, that distribution is a tool not a destination, that the best architecture is the one that lets you defer the hardest decisions until you have data. The old question was: "How fast can we move to microservices?" The better question, the one seasoned builders ask on Monday morning when they're staring at a greenfield project or a legacy system that needs refactoring: "How long can we stay simple while remaining adaptable?" That's where modular monoliths thrive. In that liminal space between chaos and premature optimization, where you're building for the system you have, not the system you imagine you'll need in three years when you're Netflix-scale and you're definitely not going to be Netflix-scale. Build the monolith. Make it modular. Extract services when you must, not when you can. The rest is just fashion.

By David Iyanu Jonathan

The Serverless Ceiling: Designing Write-Heavy Backends With Aurora Limitless

For years, serverless architectures have solved one half of the scalability problem. Compute is no longer the bottleneck. Platforms like AWS Lambda can absorb sudden traffic spikes without advance provisioning. But the moment the compute layer needs to persist data in a relational database, the model starts to strain. Thousands of concurrent functions quickly converge on a single write endpoint, and what looked like elastic scale turns into contention. This gap has traditionally forced difficult trade-offs. Teams either moved to key-value stores and redesigned their access patterns, or they implemented application-level sharding — pushing database routing into business code and accepting operational complexity. Amazon Aurora Limitless Database introduces a third option: managed horizontal sharding for PostgreSQL. It removes the need for application-managed shards while preserving SQL semantics. But it does not remove the need for architectural discipline. Aurora Limitless behaves very differently from a single-node database. Treating it as “just PostgreSQL, but bigger” leads to higher latency, higher cost, and harder debugging. This article explains how to design for those differences using a pattern that works well with serverless compute: Lambda → RDS Proxy → Aurora Limitless Understanding the New Topology In a standard Aurora cluster, applications connect to a primary writer instance. Aurora Limitless removes that concept. Instead, applications connect to a Transaction Router. The router is a stateless fleet that parses incoming SQL and determines where it should execute. Every query takes one of two paths. Fast Path: Single-Shard Execution If the query includes a shard key predicate, for example: SQL WHERE account_id = '123' The router can deterministically route the request to a single shard. Execution is local, predictable, and scales linearly. Slow Path: Scatter-Gather If the query does not include the shard key, the router must broadcast it to all shards, wait for responses, merge results, and return them to the client. The architectural objective with Aurora Limitless is straightforward: design schemas and queries so that most requests take the fast path. Why RDS Proxy Is Not Optional Serverless compute introduces bursty connection behavior. A sudden traffic surge can create thousands of concurrent Lambda invocations in seconds. Without a connection governor, those invocations attempt to establish thousands of TLS connections directly to the Transaction Router. This is where systems fail — not because queries are slow, but because connection management overwhelms the router. RDS Proxy addresses this by multiplexing many logical client connections onto a smaller pool of persistent backend connections. Twenty thousand Lambda invocations can be reduced to dozens of active database connections. Without RDS Proxy, Aurora Limitless becomes vulnerable to connection storms. With it, the router can focus on query routing rather than socket management. The Pinning Trap RDS Proxy relies on connection reuse. That reuse breaks if the application modifies session-level state. For example: JavaScript // Avoid this inside request handlers await client.query("SET search_path TO my_schema"); await client.query("SET timezone TO 'UTC'"); When session state changes, RDS Proxy must pin that client to a dedicated backend connection. Pin enough clients, and multiplexing disappears. At scale, this results in connection exhaustion and instability. Rule: All session configuration must be defined in RDS Proxy initialization queries or database parameter groups. Never issue SET commands inside Lambda handlers. Schema Design for Shard Locality Aurora Limitless introduces explicit table modes. Choosing the right one determines whether queries stay local or fan out. Sharded Tables High-volume tables, such as transactions, events, or logs, should be sharded. The shard key must be part of the primary key. SQL SET rds_aurora.limitless_create_table_mode = 'sharded'; SET rds_aurora.limitless_create_table_shard_key = '{"account_id"}'; CREATE TABLE transactions ( transaction_id BIGSERIAL, account_id UUID NOT NULL, amount DECIMAL(19,4), created_at TIMESTAMP DEFAULT NOW(), PRIMARY KEY (account_id, transaction_id) ); This guarantees that all rows for a given account reside on the same shard. Reference Tables Small, relatively static datasets such as currency codes or country lists should be defined as reference tables. These are replicated to every shard, allowing joins to remain local. Without reference tables, even simple joins introduce cross-shard network traffic. Enforcing Local Queries in Application Code Application code must respect shard boundaries. Every query should include the shard key whenever possible. Below is a Node.js Lambda example aligned with that constraint: JavaScript const { Pool } = require('pg'); const pool = new Pool({ host: process.env.DB_ENDPOINT, ssl: { rejectUnauthorized: true } }); exports.handler = async (event) => { const { accountId, amount, currency } = JSON.parse(event.body); const client = await pool.connect(); try { await client.query( `INSERT INTO transactions (account_id, amount, currency_code) VALUES ($1, $2, $3)`, [accountId, amount, currency] ); // Join remains local because currencies is a reference table const result = await client.query( `SELECT t.amount, c.exchange_rate FROM transactions t JOIN currencies c ON t.currency_code = c.currency_code WHERE t.account_id = $1 ORDER BY t.created_at DESC LIMIT 5`, [accountId] ); return { statusCode: 200, body: JSON.stringify(result.rows) }; } finally { client.release(); } }; Observability: Stop Trusting Averages Once the code is deployed, the challenge shifts to operations. In a single-node database, CPUUtilization is often a reliable signal. In Aurora Limitless, it is not. The system introduces two independent compute layers, and each fails for different reasons. Both must be observed separately. Transaction Router Metrics: High router CPU with low shard CPU usually indicates: Connection storms (missing or misconfigured RDS Proxy)TLS handshake pressureSession pinning preventing connection reuse In this case, scaling shards will not help. The bottleneck is routing and connection management. Shard Group Metrics: High CPU on one shard with low utilization on others indicates a hot shard. This almost always points to a poor shard key choice (for example, timestamp-based keys or low-cardinality values). Actionable Rule: Do not monitor cluster-wide averages. Build dashboards that explicitly separate router CPU and per-shard CPU. Averages hide the exact failure modes Limitless introduces. The Cost Model Trap: Scatter-Gather Multiplies Your Bill Performance is not the only operational risk — billing is the other. Aurora Limitless uses a serverless pricing model based on Aurora Capacity Units (ACUs). What’s easy to miss is where those ACUs are consumed. A scatter-gather query does not just run slower; it consumes compute on every shard involved. For example: SQL SELECT * FROM orders WHERE status = 'FAILED'; In a monolithic database, this is a single index scan. In Aurora Limitless, the router must broadcast the query to all shards, execute it on each shard, and aggregate results centrally. If your cluster has N shards, that query costs roughly N times more compute than a shard-local query. Actionable Rule: Audit query logs for statements that do not include the shard key. In Aurora Limitless, these are not just performance smells — they are billing risks. Distributed Sequences: IDs Are Unique, Not Ordered In a single PostgreSQL instance, BIGSERIAL values increase monotonically. Aurora Limitless intentionally breaks this assumption to avoid global coordination. Each shard is allocated ranges of sequence values independently. This means a later insert can receive a lower ID than an earlier insert. Ordering by ID no longer represents time. Safe Alternative: Always use a timestamp column (for example, created_at) for ordering, pagination, and recency queries. Rule: Treat sequence-generated IDs as identifiers only — never as a proxy for insertion order. Conclusion Aurora Limitless closes a long-standing gap between elastic compute and relational persistence. It allows SQL-based systems to scale beyond the constraints of a single writer without forcing application-managed sharding. That benefit comes with responsibility. Schema design, connection management, and query patterns directly determine whether the system scales efficiently or becomes an expensive bottleneck. If you respect shard locality, govern connections, and design with the router in mind, Aurora Limitless enables relational backends that scale with serverless workloads. If not, it simply distributes the bottleneck across more machines.

By Nabin Debnath

Claude Cowork: AI Agents’ Email Moment for Non-Coders

TL; DR: Claude Cowork AI agents have long promised productivity gains, but until now, they demanded coding skills that most agile practitioners lack or are uncomfortable with. In this article, I share my first impressions on how Claude Cowork removes that barrier, why it is a watershed moment, and how you could integrate AI Agents into your work as an agile practitioner. Why Claude Cowork Changes How Knowledge Work Will Be Done There are rarely stop-the-press moments in technology. Most “announcements” are incremental improvements dressed up in marketing language. Claude Cowork is different. Anthropic released it on January 12, 2026, and it marks a turning point for how non-developers can work with AI. Let me explain why: The PC Parallel I remember the early PC era. Everyone asked the same question: beyond typing documents, what else can I do with this thing? The answer for most people was to program a database to catalog their CD collection. Neither useful nor revolutionary. Then connectivity arrived, and with it email. Suddenly, the PC was no longer a glorified typewriter. It became a communication hub. The use case was obvious, practical, and immediately applicable to daily work. Claude Cowork feels like that moment for AI agents. Why Claude Code Felt Overwhelming Over the Christmas holidays, Claude Code dominated the AI conversation. Developers shared impressive demonstrations of autonomous coding, file management, and workflow automation. The tool runs in your terminal with full access to your local files. It can read, create, manipulate, and organize anything on your machine. The problem: I stopped using command-line interfaces after quitting DOS 6.23. I am not a developer. The terminal is intimidating, and “just learn it” is not a realistic suggestion for practitioners who have other priorities. Claude Code’s power was real. Its accessibility for non-coders was not. What Changed With Claude Cowork Claude Cowork removes the terminal barrier. It runs in the Claude Desktop app with a proper graphical interface. You point it at a folder on your Mac, describe what you want done, and it works. The underlying technology is the same as Claude Code. The experience is entirely different. Instead of typing commands into a black screen, you see a sidebar showing progress, artifacts being created, and context being tracked. You can review Claude’s plan before it is executed. You can check in during execution or let it run to completion. You come back to the finished work. (Well, at least that is the intent, putting the bumpiness of an early prototype aside.) This is what “AI agent” should have meant all along: an assistant that takes a task, works on it independently, and delivers results. Not a chatbot that waits for you after every response. My Claude Cowork Test Drive I ran a simple experiment yesterday. I have a messy folder containing 142 prompt files I collected for version 2 of the AI4Agile online course. The files are macOS .textClipping format, inconsistently named, and scattered without structure. I asked Claude Cowork to organize and tidy up this folder. I also pointed it to the AI4Agile v2 curriculum document so it could align the prompts with the course modules. You can download the Claude Cowork: AI Agents’ Email Moment for Non-Coders PDF here. Here is what happened: Claude first explored the folder and read the curriculum document. It asked clarifying questions: What output format do I want? Reorganized folders, an index document, or both? I chose the hybrid option. It then proposed a plan: convert all 142 .textClipping files to Markdown, categorize by use case, create a master index with tags for role (Scrum Master, Product Owner, Coach) and the A3 classification from my course (The A3 Assist, Automate, Avoid Decision Framework when to use AI to what extent), and map each prompt to the relevant AI4Agile module. I gave it the GO signal. The execution was not perfect. At one point, I hit an API error (529: Overloaded). Right now, Claude Cowork is in research preview, and the infrastructure is still struggling to keep up with demand. I waited and tried again. Claude resumed where it left off. The result: 128 prompts converted and organized into 10 folders, each aligned with the 8-module curriculum plus a bonus folder. A master index file with descriptions, tags, and module mappings. The original folder was left untouched so I could verify before deleting. Total time from my side: writing the initial prompt, answering three clarifying questions, and giving the GO. Claude did the rest. The Challenging Questions This experiment surfaced something I had not considered. Claude Cowork works better when your files are organized. It reads folder structures, file names, and document contents to understand context. If your digital workspace is chaos, Claude inherits that chaos. The irony is striking. People who can structure their work, separate topics at the document level, and maintain clean folder hierarchies will get more value from AI agents than people with disorganized systems. The advantage goes to those who were already disciplined. Which raises practical questions for every knowledge worker, agile practitioners included: How would you need to reorganize your work to make it accessible to an autonomous agent?What changes to your file structure, naming conventions, and folder logic would help Claude help you? I plan to ask Claude Cowork exactly this question. I suspect my habit of using my calendar as a to-do list is one of the first things that needs to go. What This Means for Agile Practitioners I teach AI for Agile practitioners. Until yesterday, I struggled to explain where AI agents fit into the daily work of a Scrum Master, Product Owner, or Agile Coach. The examples always felt theoretical. Claude Cowork makes the application concrete. Consider these use cases: A Scrum Master could point Claude at a folder of Sprint Retrospective notes and ask it to identify recurring themes across the last six months. Not a summary of one document, but a pattern analysis across many, possibly in various formats, from PNGs of stickies on a wall to a CSV file with Jira export data used as input for one Retrospective.A Product Owner or Product Manager could provide access to customer feedback files, the current product roadmap/the Product Goal, and the Product Backlog, then ask Claude to suggest which Product Backlog items address the most frequent complaints to update concerned stakeholders with a weekly status report. (I know, status reports sound unagile, but “bait the hook, feed the fish.")An Agile Coach working with multiple teams could have Claude regularly analyze meeting notes, Slack exports, and team health surveys to surface coaching opportunities. These are not chatbot tasks. They require sustained work across multiple files, context from different sources, and deliverables that go beyond a single response. Also, those tasks are most likely routine operations triggered by the availability of new evidence or data, or by a set cadence. Current Limitations Claude Cowork is a research preview with clear constraints. You need a Mac. You need a Claude Max subscription ($100-200/month). The feature does not sync across devices. Projects, Memory, and Skills are not yet integrated. Chat sharing is disabled. I expect these limitations to shrink quickly. The community response has been strong, and Anthropic has an incentive to expand access for many reasons: a new funding round is imminent, and the economic opportunities are immense. Many more non-coding professionals can now use AI agents via Claude Cowork than programmers who can use Claude Code properly. The security concern around prompt injection is real. Malicious content in files could trick Claude into taking unintended actions. Anthropic has built defenses, but agent safety is still evolving across the industry. I am cautious but willing to experiment. What I Am Testing Next Three experiments are on my list: First, I want to connect it in Chrome so Claude Cowork can browse on my behalf. For example, scanning my Twitter timeline to identify helpful posts and checking my Feedly feed to curate suitable articles for the Food for Agile Thought newsletter. If this works, newsletter curation enters a different phase. (To give you an idea: Curating, producing, and distributing a single edition of the “Food for Agile Thought” takes about 6 hours.) Second, I will test Claude Cowork on an accounting task. I need to compile a list of all my 2025 invoices that do not contain VAT for my tax filing. This means reading hundreds of Excel files, yes, I use Excel to write invoices, extracting relevant data, and aggregating it into a new spreadsheet. A perfect test of whether the productivity promise holds. Third, I want Claude to analyze my current file organization and tell me what needs to change. What folder structures, naming conventions, or documentation habits would make it more effective? I am genuinely curious what it recommends. Conclusion: The Mindset Shift The hardest part of Claude Cowork is not the technology. It means accepting that I can hand over tasks to an assistant who deserves the name. For years, “AI assistant” meant a chatbot that answered questions. Claude Cowork is different. It takes work, executes independently, and delivers results; the interaction model shifts from synchronous conversation to asynchronous delegation. This shift requires trust. It requires letting go of the need to supervise every step. It requires accepting that the output might not match exactly what you would have produced yourself, but that the time savings justify the tradeoff. I am still getting used to this idea. But after organizing 142 prompts in a fraction of the time it would have taken me to do it manually, I am motivated to keep experimenting. Claude Cowork is available now for Max subscribers on macOS. If you have been waiting for AI agents to become practical for non-coders, the wait is over.

By Stefan Wolpers

CORE

Building an Internal Document Search Tool with Retrieval-Augmented Generation (RAG)

Why RAG Matters Now Large language models (LLMs) have shown how far generative systems can go. They draft text, answer questions, and even support software development. Yet they have a clear weakness. Models trained on public data often hallucinate and almost always lack access to company-specific knowledge (Ji et al., 2023). Relying only on pre-trained knowledge is risky when answers must be exact, such as in finance, healthcare, or HR policies. Retrieval-Augmented Generation, or RAG, has emerged as a practical solution. Instead of expecting the model to know everything, RAG connects the model to external sources of truth (Lewis et al., 2020). A user query is matched with relevant documents, and the model generates a response grounded in those documents. This approach closes the gap between general intelligence and domain expertise. The open question for many developers is whether RAG is just a patch for hallucination, or if it is the foundation for enterprise-ready AI. Understanding RAG: The Technical Foundation RAG brings together two systems. The first is the retriever, which works like a search engine. It turns the user query into an embedding, searches a vector database such as FAISS or Pinecone, and returns the top matching chunks (Johnson et al., 2019; Pinecone, 2025). The second is the generator, which is the language model itself. The retriever supplies the facts, and the model shapes them into a clear answer. The pipeline is straightforward: User Query → Embedding → Vector Search → Retrieved Documents → LLM Response This extra step of retrieval means the model is not guessing. It is working with real, grounded data that belongs to the business (Lewis et al., 2020). Why Use Python and Frameworks like LangChain Python dominates AI development because of its rich ecosystem, quick prototyping, and large community (Van Rossum & Drake, 2009). Frameworks such as LangChain remove much of the boilerplate needed to connect a model with tools and memory (LangChain, 2025). Developers can focus on logic rather than wiring. Vector databases add another layer. FAISS is fast for local experiments (Johnson et al., 2019), while Pinecone is designed for scaling into production (Pinecone, 2025). Together, these tools make it possible to build reliable RAG systems in days rather than months. Practical Implementation: Building an Internal Document Search Tool Retrieval-Augmented Generation (RAG) can feel abstract until you see it in action. A good starting point is an internal document search tool. Many organizations have handbooks, policies, or product manuals that are too large for a language model to memorize. With RAG, we can build a system that searches these documents, retrieves relevant content, and produces grounded answers (Lewis et al., 2020). Data Preparation and Chunking Language models work best with short, focused pieces of text. A single handbook or PDF may run into hundreds of pages, which cannot be processed effectively in one go. To solve this, the document is split into smaller, overlapping chunks. Each chunk preserves enough context to make sense on its own (LangChain, 2025). A typical size is 500–1000 tokens with an overlap of 100–200 tokens to maintain continuity. Python from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import TextLoader # Load raw text from a file docs = TextLoader("employee_handbook.txt").load() # Split into chunks splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=150 ) chunks = splitter.split_documents(docs) Embeddings and Indexing with FAISS or Pinecone To make the text searchable, we convert each chunk into a vector representation. Embeddings capture the meaning of text as a list of numbers in a high-dimensional space. Similar chunks will be close to each other in this space (Mikolov et al., 2013). Python from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import FAISS # Create embeddings embeddings = OpenAIEmbeddings() # Store chunks in FAISS index vector_store = FAISS.from_documents(chunks, embeddings) FAISS works well for local experiments and provides fast similarity search over dense vectors (Johnson et al., 2019). For production scale, Pinecone can be used to manage storage and retrieval in the cloud (Pinecone, 2025). Python from langchain_community.vectorstores import Pinecone import pinecone pinecone.init(api_key="YOUR_KEY", environment="us-east1-gcp") index_name = "company-docs" vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name) Querying the System Once the index is built, it can be searched with a user query. The query is turned into an embedding, compared with the stored vectors, and the most relevant chunks are retrieved (Lewis et al., 2020). The code below shows the raw text retrieved from the handbook. Python retriever = vector_store.as_retriever(search_kwargs={"k": 4}) results = retriever.get_relevant_documents("What is our HR policy on remote work?") for doc in results: print(doc.page_content[:200]) The next step is to combine this with the language model. Putting It Together with LangChain LangChain makes it easy to connect the retriever with an LLM. The retriever supplies the context, and the model generates a final answer (LangChain, 2025). Python from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4", temperature=0) qa = RetrievalQA.from_chain_type( llm=llm, retriever=retriever, chain_type="stuff" ) query = "Summarize our HR policy on remote work" response = qa.run(query) print(response) Here, the answer is grounded in the actual policy text. If the context does not contain the information, the model should respond that it does not know. This reduces hallucination and improves trust (Ji et al., 2023). Diagram: Workflow Overview Below is a simple view of the workflow: This workflow shows how RAG bridges the gap between raw documents and intelligent answers. With just a few steps — chunking, embedding, indexing, retrieving, and generating — you can turn static files into an active knowledge assistant for your team. Real-World Benefits and Metrics The main advantage of RAG is that it produces answers grounded in real documents rather than relying on statistical guesswork. This grounding improves accuracy and helps reduce hallucinations common in standalone language models (Ji et al., 2023; Lewis et al., 2020). For organizations where precision is critical — such as finance, healthcare, or compliance — this difference can make adoption possible. RAG also improves efficiency. Employees no longer need to manually search through long PDFs or outdated wikis. A well-designed RAG system can retrieve and summarize relevant information in seconds, saving time in knowledge management tasks (LangChain, 2025). To measure value, developers can track three practical metrics: First, response accuracy, evaluated through human review or benchmarks.Second, latency per query, which determines production readiness.Third, cost per 1,000 queries, combining token usage and storage costs (Pinecone, 2025). Several companies already use RAG to build internal policy search tools or to provide more reliable customer service answers, demonstrating that these benefits are achievable at scale. Common Pitfalls and How to Avoid Them Like any system, RAG has weaknesses. The principle of “garbage in, garbage out” applies strongly. If documents are outdated or poorly written, the answers will reflect that. Retrieval quality is also sensitive to indexing. Indexing too broadly can pull in irrelevant results, while indexing too narrowly risks missing context (Johnson et al., 2019). Another issue is context overload. Language models have limits on how much text they can process at once. Overloading the context window can lead to drift, where the model ignores important sections or produces inconsistent answers (Ji et al., 2023). Practical recommendations include carefully chunking documents to preserve coherence, applying metadata filters to keep retrieval focused, and regularly monitoring outputs to catch drift or bias. With these measures in place, RAG systems can remain robust even in demanding environments. Key Takeaways RAG solves a critical limitation of large language models: it grounds responses in verified, organization-specific documents, reducing hallucinations and boosting reliability.Building a RAG-based internal search system is now accessible thanks to frameworks like LangChain, FAISS, and Pinecone, which handle embeddings, retrieval, and orchestration.Proper data chunking and indexing are essential. Well-structured document splits (500–1000 tokens) with overlap ensure coherent and context-rich retrieval.Performance can be measured and optimized through three core metrics: response accuracy, latency per query, and cost per 1,000 queries.RAG is not just a temporary patch — it is the foundation for enterprise-ready AI, enabling companies to turn private data into intelligent, grounded, and trustworthy assistants.

By Manish Adawadkar

GraphQL vs REST — Which Is Better?

GraphQL and REST are two of the most popular API frameworks currently in use, with GraphQL seeing increasing adoption. GraphQL was developed by Facebook to address key challenges in REST APIs, particularly in reducing latency for mobile applications. The debate over which is better continues within the developer community, with strong arguments on both sides. The choice between GraphQL or REST depends on the type of application being built. GraphQL addresses latency issues caused by over-fetching and under-fetching, and provides built-in support for real-time updates. On the other hand, REST leverages HTTP constructs for caching, rate control, and telemetry, allowing these concerns to be handled at the platform layer rather than the application layer. Let’s explore the pros and cons of GraphQL and what an architect should evaluate when making a decision for their project. GraphQL Pros Avoidance of Data Over-Fetching/Under-Fetching One of GraphQL’s biggest advantages is preventing over-fetching of data. In REST, a single API endpoint may return more data than needed for certain clients, especially in mobile applications with limited UI real estate. This issue is typically addressed in REST using the Backend-for-Frontend (BFF) pattern, but there is no standard BFF framework. GraphQL allows clients to specify exactly which fields they need, preventing unnecessary data retrieval. Example A web app might request a user’s name, address, and email.A mobile app might request only the user’s name to optimize bandwidth. Benefit Conserves bandwidth and reduces latency. Reduced Latency REST does not have a built-in orchestration layer, meaning a client might need to call multiple REST APIs to fetch all required data. With GraphQL, a single endpoint can handle multiple entities at once, reducing network calls. GraphQL achieves this through GraphQL servers like Apollo Server, Hasura, and GraphQL Mesh, which resolve queries, orchestrate backend calls, and return a unified response. This could be achieved in REST via BFF pattern but there is no standard implementation available. REST Example A client fetching user data and order history would need: /users/{id} → Fetch user details/orders/{userId} → Fetch order history GraphQL Example Plain Text { user(id: 1) { name email orders { orderId totalPrice } } } Benefit Reduces multiple network calls into a single request, lowering latency. Real-Time Updates Subscriptions One of the biggest pros of GraphQL is its support for real-time updates via subscriptions, allowing clients to receive automatic updates when data changes. Example Use Cases Live stock market updatesChat applicationsReal-time notifications GraphQL subscriptions typically use WebSockets, though alternatives like Server-Sent Events (SSE) and polling are also possible. Benefit Efficient real-time data handling, better than REST-based polling. Decoupling of Frontend from Backend Implementation GraphQL actors (e.g., clients, servers, resolvers) interact with data as an entity graph, vis-à-vis a REST API, which is based on structured endpoints. Thus, GraphQL enables higher decoupling of the frontend from the backend implementation. Benefit Frontend teams can evolve independently from backend teams without requiring backend API changes. Versioning In REST, API versioning (/v1/users, /v2/users) is required when changing or deprecating fields. GraphQL eliminates versioning by marking deprecated fields without breaking existing clients: Plain Text type User { id: ID! name: String email: String @deprecated(reason: “Use contactEmail instead”) } Benefit Backend modifications do not force frontend updates. GraphQL Cons With that, let’s look at GraphQL's cons. One of the biggest issues is the impact on HTTP constructs which traditionally web applications have relied upon for cross cutting concerns like telemetry, caching, rate limiting and security. Caching REST leverages URL-based caching at the CDN, API Gateway, Reverse proxy and browser levels. Example You can have separate caching policies for following endpoints based on URL pattern, with the cache key being generated out of the URL Pattern and query param /locations?city=NewYork/restaurants?category=Italian GraphQL, however, uses a single endpoint, making traditional URL-based caching ineffective. We could overcome this by using HTTP GET for queries (cannot use GET for mutations). Caching can then be governed based on query params. However, this approach can become very complex because there may be several variations of the query for the same entities, and even within the data properties of entities. Without careful design, caching policy using this approach may become infeasible or less effective than in REST. In addition, developers must be careful not to use GET for large queries, as browsers enforce size limits on query parameters. This becomes a big challenge when an architect wants to decouple caching from one of the network layers (e.g., a CDN). This issue can be partially resolved using GraphQL’s persisted queries. In persisted queries, the client passes a query ID rather than the whole query. This helps address query variability and the query param length issue. However, the GraphQL server now needs to store the client query schema, thus introducing a bit of coupling between frontend and backend. Tools like Apollo GraphQL provide sophisticated caching capabilities at the client or server. But the CDN (edge) side caching challenge remains, even though Cloudflare and Akamai now offer better support for GraphQL caching. Furthermore, not all GraphQL implementations support HTTP GET and rely on POST for both queries and mutations, which renders this approach infeasible. This could be overcome by sending query params (say the “operation name”) in a POST request, but that will make it a non-standard implementation and an anti-pattern. Summary Due to GraphQL’s single endpoint structure, caching must often be handled at the application level (client and server) rather than being delegated to platform layers like CDN or reverse proxies (e.g., NGINX), as is common in REST APIs. Telemetry REST relies on structured URLs (/users, /orders), making it easy to monitor performance per endpoint. Since GraphQL uses one endpoint (/graphql), API Gateways and CDNs cannot track performance per entity. These stats need to be captured at the GraphQL application layer, and many telemetry tools now provide query-level insights. CDN tools like Akamai provide Telemetry for GraphQL by analyzing the operationName param in request. Persisted queries also enhance the Telemetry support. Yet fine-grained telemetry analysis, such as REST, is not possible in the CDN/Load Balancer/API Gateway layers. Summary GraphQL telemetry must be handled at the application level, unlike REST, where platform-level monitoring is possible. Rate Limiting Rate limiting can be done at CDN or API Gateway layer for specific REST APIs at the URL level. However, for GraphQL, rate limiting needs to be done based on query depth or complexity by GraphQL-aware Gateways like Apollo/Hasura; This approach is more complicated than REST, and once again, pushes this concern to the application layer. REST APIs allow rate limiting per endpoint (/users limited to 1000 requests per hour), and this can be implemented either at the application layer or the platform layer (CDN, API Gateway). GraphQL requires analysis of query complexity, as a single request can return vast amounts of data. Hence, rate limiting needs to be done based on query depth or complexity by GraphQL-aware Gateways like Apollo/Hasura, thus making it much more complicated than REST Summary GraphQL rate limiting requires more advanced logic and should be implemented at the application layer, where more options are available. Limited Platform Support Cloud and CDN support for GraphQL is getting stronger. However, there are still gaps. For example, AWS API Gateway does not support input validation for HTTP APIs (such as GraphQL) compared to REST APIs. AWS API Gateway does not support Resource policies for HTTP APIs for access control (for example, restrict access by IPs). Summary Platform support for GraphQL is growing but is more advanced for REST Security Poorly designed GraphQL applications can introduce several security vulnerabilities. Schema introspection exposes API details that malicious actors can use to mount attacks. Deep queries or query variations can cause Denial-of-Service (DoS) attacks by bypassing caching. This vulnerability can be mitigated by disabling introspection in Production. Strictly Persisted Queries offer the most secure solution, but introduce a level of coupling between frontend and backend Authorization in REST can be implemented via endpoint-based authorization, and this can be done at the API Gateway or CDN layer. In GraphQL, this must be done at the application level to determine which queries in the request require authorization. Summary REST security can be enforced at platform layers (API Gateways and WAFs), but GraphQL requires resolver-level security and needs more careful design. Error Handling REST APIs return standard HTTP status codes: 200 OK → Success404 Not Found → Invalid request500 Internal Server Error → Server issue GraphQL always returns HTTP 200, even for errors, embedding error messages inside the response body rather than using standard HTTP status codes. This behavior requires additional handling in monitoring tools and client applications, which traditionally rely on HTTP status codes (404 Not Found, 500 Internal Server Error) to determine API failures. Plain Text { “data”: null, “errors”: [ { “message”: “User not found”, “code”: “NOT_FOUND” } ] } Summary Traditional HTTP-based monitoring tools need customization for GraphQL error handling. Conclusion There is no one-size-fits-all framework. GraphQL is a powerful API framework with multiple benefits: Frontend-backend decouplingReduced latency for mobile applications and data-intensive web applicationsResolution of API versioning issues However, using GraphQL removes several of the REST framework’s advantages accrued from using HTTP constructs for caching/telemetry/security, which are typically handled at the platform level — CDN, Load Balancer, API Gateway, or Reverse Proxy — in REST. With GraphQL, these concerns often shift to the application layer, requiring additional effort from developers. Use GraphQL If Your app requires complex or multi-entity data fetchingYou need real-time updatesYou want frontend flexibility & backend decoupling Use REST If You need platform-level caching, telemetry, and securityYou want simpler rate limiting & security enforcementYou require standard HTTP-based error handlingYour app has moderate data requirements Finally, for many projects, GraphQL and REST can coexist, allowing you to leverage the strengths of each.

By Ananth Iyer