Coding Resources

DZone's Featured Coding Resources

Unlocking the Potential: Integrating AI-Driven Insights with MuleSoft and AWS for Scalable Enterprise Solutions

By Abhijit Roy

This article explores the transformative potential of integrating artificial intelligence (AI)-driven insights with MuleSoft and AWS platforms to achieve scalable enterprise solutions. This integration promises to enhance enterprise scalability through predictive maintenance, improve data quality through AI-driven data enrichment, and revolutionize customer experiences across industries like healthcare and retail. Furthermore, it emphasises navigating the balance between centralized and decentralized integration structures and highlights the importance of dismantling data silos to facilitate a more agile and adaptive business environment. Enterprises are encouraged to invest in AI skills and infrastructure to leverage these new capabilities and maintain competitive advantage. Introduction Not long ago, I had one of those "aha" moments while working late at our Woodland Hills office. Picture this: I was elbows-deep in the spaghetti of our MuleSoft integrations, and it hit me — what if we could fuse our conventional setup with AI-driven insights to revolutionize our enterprise scalability? As someone who has spent countless hours with MuleSoft and AWS, toggling between Anypoint Platform and cloud paradigms, I realized we were standing on the precipice of something transformative. The Magic of AI-Augmented Integration Platforms The trend of merging AI with platforms like MuleSoft is becoming a game-changer. Think about it — self-optimizing integration pipelines that don't just react but predict. AI-driven anomaly detection is no longer a futuristic notion but a present-day reality. A critical takeaway here is that enterprises must shift their focus toward building predictive maintenance into their integration solutions. This isn't just about reducing downtime; it's about reliability, a quality all stakeholders crave. Here's a personal aside: in one of my projects at TCS, we faced repeated disruptions due to undetected anomalies in our pipeline. After integrating an AI-centric approach using AWS’s AI/ML services, we saw a 30% decrease in system alerts. It felt like watching a well-oiled machine where everything just fit. It was hard work getting there, but the reduced manual monitoring was worth every bit of effort. Centralized Control vs. Decentralized Agility Let's face it — a debate that's been brewing is centralized versus decentralized integration. I'm of two minds here. Centralized platforms like MuleSoft offer comprehensive control, yet there's a strong argument for decentralized, microservices-led frameworks powered by AI. These can make autonomous decisions at the edge, thus providing agility. In practice, evaluating trade-offs is crucial. During Farmers Insurance projects, we struggled with balancing centralized governance with the nimbleness of decentralized systems — often a tug-of-war. Through trial and error, we realized that a hybrid approach, leveraging MuleSoft for core integrations while empowering microservices with AI-driven intelligence, struck the right chord. The key was not in choosing sides but in finding harmony between the two. Cross-Industry Applications: Breaking the Mold AI-driven insights aren’t limited to tech giants — they're creeping into retail and healthcare, too. In a recent pilot, we explored using MuleSoft solutions in a healthcare setting, where real-time data processing played a critical role in patient interactions. The challenge was integrating vast datasets, something AI handled adeptly. The result? Improved patient engagement and faster response times. In another example, a retail client used AI integration to enrich customer experiences, from personalized offers to stock predictions. You might say these are exceptions, not the rule, but they demonstrate the potential of cross-industry applications. The lesson here? Look beyond traditional tech spaces for unique use cases and new revenue streams. AI-Driven Data Enrichment: A Technical Deep Dive One of the lesser-known but powerful capabilities of AI is data enrichment. Within MuleSoft and AWS environments, machine learning algorithms are at work to refine and enhance data for superior analytics. It's like having a data wizard on your team. In practical terms, we deployed advanced algorithms to improve data quality at Farmers Insurance. The challenge was ensuring seamless integration without disrupting existing architectures — a frequent pain point. This experience taught us the importance of innovative middleware solutions to streamline AI insights integration. The result? Enhanced data accuracy and business intelligence, empowering informed decision-making. Lessons from the Trenches: Navigating Market Dynamics Market dynamics are shifting rapidly, but the struggle with siloed data persists. Inefficient integration architectures can be a thorn in the side of digital transformation. Here, AI-driven insights can play a crucial role. In a project where data silos were hindering progress, we revamped our strategy. By prioritizing AI integrations, we dismantled these silos, resulting in a more fluid and flexible system. The critical lesson was understanding that breaking down silos is just as important as building new integrations. A balance of both ensures scalable and adaptive solutions. Future Horizons: Preparing for the AI Revolution The enterprise integration landscape is on the cusp of a new era. AI-driven insights will automate decision-making and predictive analytics, fundamentally changing business operations and competitive dynamics. To stay ahead, it's imperative for companies to invest in AI skills and infrastructure. In my own journey, continuous learning and adaptation have been key. Embracing new technologies and methodologies isn't just a requirement — it's an ongoing pursuit of excellence. And yes, I still hit roadblocks. There's always more to learn, more to implement, but that's what makes this field so exciting. Conclusion: Embracing the Transformation Integrating AI-driven insights with MuleSoft and AWS opens doors to innovation and competitiveness. As we stand on the verge of this transformation, the opportunities are vast. By focusing on emerging trends, questioning conventions, and exploring new applications, enterprises can unlock unprecedented value. In conclusion, if you're like me, sipping a coffee and wondering how to elevate your integration game, take the leap. Blend AI with your MuleSoft and AWS strategy, embrace imperfections, learn from every hiccup, and watch your enterprise soar to new heights. More

Enterprise Java Applications: A Practical Guide to Securing Enterprise Applications with a Risk-Driven Architecture

By Sravan Reddy Kathi

Enterprise Java applications still serve business-critical processes but are becoming vulnerable to changing security threats and regulatory demands. Traditional compliance-based security methods tend to respond to audits or attacks, instead of stopping them. This paper introduces a risk-based security architecture, which focuses on protection according to the impact of the business, the probability of the threat, and exposure. The threat modeling, dependency risk analysis, and layered security controls help organizations to minimize the attack surfaces beforehand without impacting on performance and delivery velocity. The strategy is explained with the help of real-life examples of enterprise Java to facilitate its use in practice. Intended Audience The audience targeted in the article is those an enterprise architect, senior Java developer, security architect, and DevSecOps teams who are required to design, modernize or secure large-scale Java applications. In recent years, there are a number of breaches of enterprises that have not been initiated by a zero-day exploit but a known vulnerability, which has not been prioritized e.g. an outdated library, an open API, or a poorly configured integration In a number of instances, the organizations were technically compliant but still exposed because of the homogenous, checklist-driven security measures that did not concentrate on the high-risk elements. This article describes how a risk-based security architecture can help enterprise Java teams transition to business-oriented, proactive, and not reactive security. Readers will gain knowledge of how to determine high-impact risks, selectively apply layered security controls, and incorporate security decisions into modernization efforts like Java version upgrades or framework migrations. Problem Definition and Motivation Enterprise Java systems are usually made up of several layers such as web APIs, messaging systems, databases, identity providers, and external integrations. The consideration of all the components in equal terms in terms of security results in inefficient controls and blind spots. Risk Prioritization Logic An early decision-making can be made with the help of a simple risk matrix: Likelihood: How easy is exploitation?Impact: What does the business impact?Exposure: Does the component face the internet or is internal?Criticality: Does it support revenue, identity or compliance data? This prioritization at an early stage makes sure that security is being done on what is important. Threat Modeling in Practice Threat modeling should be maintained as practical and not theoretical. Mini Example: REST API Threat Modeling (STRIDE) Take the example of an enterprise Java REST API that deals with authentication: Spoofing: Weak validation token impersonation.Tampering: Manipulation of payloads in unsecured endpoints.Repudiation: Privileged action audit logs are missing.Information Disclosure: Overwhelming error messages.Denial of Service: Unthrottled endpoints.Elevation of Privilege: Role bypass through improperly configured filters. This analysis identifies the endpoints that need more authentication, logging, and rate limiting- prior to the incidence. This workflow is presented in Figure 1, where the priorities of assets, threats, and mitigations are established in a systematic manner. Figure 1: The risk-based security threat-modelling workflow. The audit helps to understand what modules should be covered at once and which ones can be covered progressively. Dependency and Vulnerability Management Enterprise Java applications are highly dependent on third-party libraries and dependency risk management is very important. Practical Prioritization Rule Rather than reacting to every CVE: Priority Score = CVSS × Exposure × Business Criticality A medium CVSS vulnerability in an external authentication component can be more pressing than a critical CVE in an internal utility library.This will eliminate churn in upgrades that are not necessary and will deal with actual threats. Table 1 provides a summary of how enterprise teams can categorize dependencies and implement specific remediation. Table 1: Enterprise Java systems risk components and recommendations. Layered Security Controls Risk-driven architecture implements selective controls at layers: Application Layer: Token hardening, secure authentication, input validation.Integration Layer: OAuth2, mTLS, API gateway enforcement.Data Layer: Data masking, access control, encryption. Figure 2 visually illustrates the correspondence of these controls to architectural layers and the areas of high risk that should be enforced more. Figure 2: Application, Integration, and Data layers of security controls. The controlling systems may be more rigorous with high-risk modules and may assist low-risk areas, without affecting performance and security. Continuous Monitoring and Response The implementation of risk-driven security is not the end. The high-risk components should consist of: Selective logging and monitoring.Real-time alertingClear incident response playbooks. This ensures early detection and rapid response, especially for authentication flows and external integrations. Case Study: Java-based Legacy SAP-Integrated System Modernization A large enterprise found during a Java 8 to Java 17 migration: ActiveMQ endpoints that can be intercepted.Obsolete Jersey APIs with no input validation.Reporting modules that are performance sensitive. Instead of using the same level of controls, the team focused on high-risk parts first. Figure 3 shows the modernization roadmap, audit to monitoring. Figure 3: Implementation plan for risk-driven security methodologies. The outcome was a safe modernization that had little performance consequences and reduced delivery schedules. What to Do Next: Action Checklist List all enterprise Java components and integrations.Determine risky assets with a simple risk matrix.Lightweight threat modeling of critical APIs.Rank dependencies based on CVSS + exposure + business impact.Use layered security where the risk is greatest.Introduce high-risk workflow monitoring.Reevaluate risks when upgrading or changing architecture. Conclusion Risk-based security architecture helps enterprise Java teams to shift their compliance-driven efforts to proactive, business-oriented protection. Through prioritization of risks, intelligent application of layered controls and integration of security into modernization initiatives, organizations can go a long way in minimizing exposure without compromising performance or agility. By taking this path, security can be transformed into a defensive requirement as well as a strategic facilitator of enterprise systems. More

Why Queues Don’t Fix Scaling Problems

By David Iyanu Jonathan

MCP + AWS AgentCore: Give Your AI Agent Real Tools in 60 Minutes

By Jubin Abhishek Soni

CORE

How We Reduced LCP by 75% in a Production React App

By Satyam Nikhra

Mastering Multi-Cloud Integration: SAFe 5.0, MuleSoft, and AWS - A Personal Journey

The article explores the journey of multi-cloud integration through the lens of personal experience, focusing on integrating MuleSoft and AWS using SAFe 5.0 principles. It begins by outlining the necessity of multi-cloud solutions in today's digitally connected world, highlighting challenges such as security and vendor lock-ins. The author discusses overcoming these challenges by employing SAFe 5.0's modular designs and integrating AI services like AWS SageMaker with MuleSoft for real-time decision-making. The article also emphasizes the importance of comprehensive training and cross-functional collaboration to bridge skills gaps. A real-world case study illustrates the approach’s success in reducing latency for an e-commerce giant. The conclusion stresses continuous learning and aligning technical initiatives with business objectives as key to leveraging multi-cloud environments. Introduction I still remember the first time I heard the term "multi-cloud integration." It was during a client meeting at Tata Consultancy Services in 2014. Fresh-faced and eager, I couldn't fathom the complexities that lay ahead. Fast forward to today, I find myself at the heart of pioneering integrations leveraging SAFe 5.0 principles with MuleSoft and AWS — a journey full of insights, occasional blunders, and numerous successes. Let's dive into this strategic blueprint which modern enterprises can adopt for optimizing their multi-cloud strategies. Embracing the Multi-Cloud Revolution In today's digitally connected world, multi-cloud solutions are more of a necessity than an option. From banking to retail, industries are transitioning to multi-cloud environments to harness flexibility, scalability, and redundancy. But with great power comes great responsibility, especially when it comes to security and governance. Emerging Trends: Security and Governance at the Forefront The financial sector, often risk-averse, has been a significant adopter of MuleSoft and AWS for real-time data processing. I recall a project where we integrated real-time transaction data across several cloud environments for a leading bank. We utilized AWS's Lambda for automated validations, ensuring compliance across different jurisdictions — a crucial step in maintaining data integrity and security. Personal Insight: During our deployment, we found that while AWS and MuleSoft offer robust frameworks for security, the challenge lay in integrating these seamlessly. Detailed planning and understanding of each platform's native capabilities were vital. My advice? Never underestimate the power of thorough documentation and the importance of a well-documented API architecture. The Contrarian View: The Vendor Lock-in Debate Many advocate that multi-cloud strategies eliminate vendor lock-in. Yet, as someone who's navigated these waters, I challenge this notion. The intricacies of integration can often weave a web of dependencies, especially when working with MuleSoft and AWS. Solving the Dependency Puzzle with SAFe 5.0 One strategy we've employed is designing modular and agnostic solutions. Utilizing SAFe 5.0's modular design principles, we ensure our integrations are flexible and can pivot with changing vendor landscapes. In a recent project at a healthcare firm, we leveraged MuleSoft's Anypoint Platform to create a loosely coupled architecture, enabling easy transitions between cloud providers. Lesson Learned: Over-engineering for flexibility can be a pitfall, adding unnecessary complexity. It's about striking a balance — focusing on critical services that need agility while ensuring core systems remain stable and robust. Surviving the Technical Trenches: AWS AI and MuleSoft Integrating AI services like AWS SageMaker with MuleSoft has been a game-changer, enabling real-time intelligent decision-making. For instance, in a retail analytics project, we created custom connectors in MuleSoft for seamless data flow into SageMaker, enhancing predictive analytics and improving customer personalization. Technical Deep-Dive: Crafting Custom Connectors Creating these connectors isn't just about linking systems; it’s about understanding the data lifecycle and business objectives. We encountered challenges with data latency and consistency, but by iterating our API definitions and leveraging AWS's data pipeline services, we achieved near-instantaneous data processing — a key success metric in that project. Behind the Scenes: Engaging with MuleSoft's C4E team was instrumental in overcoming integration roadblocks. If there's one thing I’ve learned, it’s that community collaboration often yields the most innovative solutions. Bridging the Skill Gap with SAFe 5.0 Despite its many benefits, the learning curve for integrating MuleSoft and AWS using SAFe 5.0 principles is steep. Here's what worked for us: Comprehensive Training Programs: We developed focused training sessions highlighting SAFe 5.0 frameworks and contextualizing them within our projects. This approach demystified complex topics and empowered our teams to innovate confidently. Cross-Functional Collaboration: By facilitating dialogue across departments — from developers to QA teams — we fostered a culture of shared knowledge and innovation. This collaborative ethos became a bedrock for overcoming integration hurdles. Real-World Implementation: A Case Study Last year, we spearheaded an integration initiative for an e-commerce giant aiming to reduce latency in order processing. Utilizing AWS's Outposts and Local Zones, paired with MuleSoft's capabilities, we achieved remarkable results. Concrete Example: We reduced latency by 40%, improving customer satisfaction scores by a significant margin. The key was aligning technical prowess with business goals—something SAFe 5.0 principles advocate strongly. Actionable Takeaway: Always align technical initiatives with overarching business objectives. It's not just about the technology; it's about driving tangible business outcomes. Conclusion: The Road Ahead The integration of MuleSoft with AWS, underpinned by SAFe 5.0 principles, offers a robust framework for tackling modern multi-cloud challenges. As we look to the future, the demand for hybrid solutions with integrated AI capabilities will only grow. Final Thought: If there's one piece of advice I'd impart — never stop learning. The technology landscape is ever-evolving, and staying curious ensures we remain at the forefront of innovation. As I share these hard-won insights over a metaphorical cup of coffee, I hope they serve as a guide for your own multi-cloud journey. Let's embrace the complexities with enthusiasm and turn challenges into opportunities for growth.

By Abhijit Roy

Run AI Agents Safely With Docker Sandboxes: A Complete Walkthrough

There are days when I want an agent to work on a project, run commands, install packages, and poke around a repo without getting anywhere near the rest of my machine. That is exactly why Docker Sandboxes clicked for me. The nice part is that the setup is not complicated. You install the CLI, sign in once, choose a network policy, and launch a sandbox from your project folder. After that, you can list it, stop it, reconnect to it, or remove it when you are done. In this post, I am keeping the focus narrow on purpose: Set up Docker Sandboxes, run one against a local project, understand the few commands that matter, and avoid the mistakes that usually slow people down on day one. What Are Docker Sandboxes? Docker Sandboxes give you an isolated environment for coding agents. Each sandbox runs inside its own microVM and gets its own filesystem, network, and Docker daemon. The simple way to think about it is this: the agent gets a workspace to do real work, but it does not get free access to your whole laptop. That is the reason this feature is interesting. You can let an agent install packages, edit files, run builds, and even run Docker commands inside the sandbox without turning your host machine into the experiment. Before You Start You do not need a big lab setup to try this, but you do need: macOS or Windows machine installedWindows "HypervisorPlatform" feature enabledDocker Sbx CLI installedAPI key or authentication for the agent you want to use If you start with the built-in shell agent, Docker sign-in is enough for your first walkthrough. If you want to start with claude, copilot, codex, gemini, or another coding agent, make sure you also have that agent's authentication ready. If you are on Windows, make sure Windows Hypervisor Platform is enabled first. PowerShell Enable-WindowsOptionalFeature -Online -FeatureName HypervisorPlatform -All If Windows asks for a restart, do that before moving on. Note: Docker documents the getting-started flow with the sbx CLI. There is also a docker sandbox command family, but sbx is the cleanest way to get started, so that is what I am using in this walkthrough. Step 1: Install the Docker Sandboxes CLI On Windows: PowerShell winget install -h Docker.sbx On macOS: PowerShell brew install docker/tap/sbx That is it for installation. If sbx is not recognized immediately after install, open a new terminal window and try again. I hit that once on Windows after installation, and a fresh terminal fixed it. Note: Docker Desktop is not required for sbx. Step 2: Sign In Now sign in once: PowerShell sbx login This opens the Docker sign-in flow in your browser. During login, Docker asks you to choose a default network policy for your sandboxes: Open – Everything is allowedBalanced – Common development traffic is allowed, but it is more controlledLocked down – Everything is blocked unless you explicitly allow it If you are just getting started, pick Balanced. That is the easiest choice for a first run because it usually works without making the sandbox too open. Step 3: Pick a Small Project Folder You can use an existing project folder, or create a tiny test folder just for this walkthrough. For example: PowerShell mkdir hello-sandbox cd hello-sandbox If you want, drop a file into it so you have something visible inside the sandbox: PowerShell echo "# hello-sandbox" > README.md Nothing fancy is needed here. The goal is just to have a folder you are comfortable letting the agent work in. Step 4: Run Your First Sandbox Here is the command that matters most: PowerShell sbx run shell . Figure 1.1: Shows how to create a new sandbox using Sbx command What this does: Starts a sandbox for the shell agentMounts your current folder into the sandboxOpens an isolated environment where the agent can work on that folder If you prefer naming your sandbox from the start, use: PowerShell sbx run --name my-first-sandbox shell . On the first run, Docker may take a little longer because it needs to pull the agent image. That is normal. Later runs are much faster. I like starting with shell because it is the easiest way to prove the sandbox is working before you bring an actual coding agent into the mix. Once that works, replace shell with the agent you actually want to use, such as claude, copilot, codex, gemini, or another supported agent from the Docker docs. Step 5: See What Is Running To check your active sandboxes, run: PowerShell sbx ls You should see output with a name, status, and uptime. This is a handy command because once you start using sandboxes regularly, it becomes the quickest way to see what is still running and what needs cleanup. Figure 1.2: Shows how to verify list of all active sandboxes running on the machine Step 6: Switch to a Real Coding Agent Once you have proved the sandbox works with shell, move to the coding agent you actually want to use. For example: PowerShell sbx run copilot Figure 1.3: Shows how to run Copilot agent on Docker sandbox or PowerShell sbx run gemini Figure 1.4: Shows how to run gemini agent on Docker sandbox The workflow is the same as shell. The only thing that changes is the agent inside the sandbox. If the agent needs its own provider login or API key, complete that setup and then continue. The important point is that the agent is still running inside the sandbox, not directly on your host machine. Step 7: Stop the Sandbox When You Are Done When you are finished using Sandbox, you can stop it by running the command below: PowerShell sbx stop copilot-dockersandboxtest If you don't remember the name, run sbx ls first to see all the active sandboxes running. Stopping is useful when you want to pause work without removing the sandbox immediately. Step 8: Remove the Sandbox When You No Longer Need It When you are done for good, you can remove it by running the command below: PowerShell sbx rm copilot-dockersandboxtest Or remove all sandboxes by simply passing --all flag as shown below: PowerShell sbx rm --all Figure 1.5: Removing all sandboxes using sbx rm --all command Step 9: Use YOLO Mode Safely Now for the newer idea Docker has just announced, which is YOLO mode. If you want to read more about it, refer to Docker's recent blog post, which is worth bookmarking: Docker Sandboxes: Run Agents in YOLO Mode, Safely. In simple terms, YOLO mode means letting a coding agent work with fewer interruptions and fewer approval prompts. That can save time, but it only makes sense when the agent is already inside a sandbox. Note: I would not start with YOLO mode on day one. I would start with a normal sandbox run, get comfortable with the lifecycle first, and only then try YOLO mode. Conclusion This article explains Docker Sandboxes and provides step-by-step instructions for getting started. What I like about Docker Sandboxes is that they remove a lot of friction from a very real problem. Sometimes you want an agent to have freedom, but not too much freedom. You want it to run commands, inspect files, and do useful work, but you also want a clear boundary around that work. That is the sweet spot Docker Sandboxes are aiming for. If you are curious about them, my advice is simple: do not start with a giant repo or a complicated setup. Pick one small folder, use the Balanced policy first, run a single sandbox, and get comfortable with the basic lifecycle first. Once that clicks, the rest feels much easier to work in YOLO mode.

By Naga Santhosh Reddy Vootukuri

CORE

TOP-5 Lightweight Linux Distributions for Container Base Images

The base Linux distribution we choose for building our container images affects the whole container stack: image size, performance, CVE exposure, patch cadence, debugging, maintainability. This is why going for some random base that ‘just works’ is not an option. Luckily, there are multiple good options on the market for various use cases and business needs. This guide is aimed at providing you with the summary of top five lightweight Linux distributions chosen for their production-relevance: small, container-focused, actively maintained, chosen by developers. The summary is based on criteria important for production, such as footprint, libc variant, licensing, security features, support. Note that this article is not a best-to-worse ranking, and the distros are listed alphabetically with the most popular one opening the list. These distributions are built for different goals, teams, and risk profiles. Our goal here is to provide a data-based comparison using information available from vendor documentation, official websites, and container registries. The point is to help you make an informed decision for your own use case, not to crown a universal winner. Alpine Linux Alpine Linux is the first distribution that comes to mind when one says ‘a lightweight base for containers’. It is minimalistic, clean, simple, and very common in Dockerfiles. It doesn’t include any unnecessary packages and uses musl libc instead of glibc like most other distributions. Contrary to glibc, musl was developed with minimalistic design in mind and so, it has smaller static and dynamic overhead.BusyBox instead of GNU Core utilities. BusyBox is a set of command-line Unix utilities with the size of about 1MB, which means that distributions based on BusyBox consume much less memory. Small and modular OpenRC init system instead of systemd. Alpine Package Keeper or apk as a package manager, which is smaller than yum/rpm or deb/apt. All of that contributes to Alpine’s miniature size — the compressed image size of Alpine on Docker Hub is less than 4 megabytes. At the same time, if you need extra packages, you can add them from the repo. As far as security is concerned, Alpine was designed with security in mind, The lack of extra packages reduces the attack surface. Plus, there are additional security features such as compiling userland binaries as Position Independent Executables (PIE) with stack smashing protection. Alpine is 100% free and community-based. There’s no single distro-wide EULA, and the package licenses vary and must be checked per package. The Alpine team does not provide enterprise support for Alpine, but it is available from third-party vendors as part of their commercial offerings. As for releases, Alpine has a predictable rhythm. The stable branches are released twice a year, in May and November. There’s no vendor “LTS” program in the enterprise sense, but the main repository is generally supported for about two years. Ironically, its drawbacks come from its strong sides. The musl libc may have inferior performance as compared to glibc for some workloads, especially Java-based ones. Some teams may experience compatibility issues when migrating their container images to a musl-based distribution. In addition, lack of dedicated support from the project team may be unsuitable for enterprises looking for strict SLAs for patches and fixes. Alpaquita Linux Alpaquita Linux is developed and supported by BellSoft. Like Alpine, it was designed to be minimalistic, efficient, and secure. At the same time, its goal is to close the gap between open-source lightweight images and enterprise expectations. Alpaquita also includes only essential packages and uses BusyBox, OpenRC and apk. But as for libc, it offers two flavors — glibc and musl perf with performance equal or superior to glibc depending on the workload. The choice enables the teams to leverage musl efficiency without impacts on performance or to stay on glibc and still get the reduced footprint. The Alpaquita musl images on Docker Hub are less than four megabytes, the glibc ones are about nine megabytes. Although Alpaquita Linux is compatible with various runtimes and offers ready images for Java, Python, and C++, its main strength is in the Java realm. Alpaquita integrates seamlessly with other BellSoft’s products for Java development, Liberica JDK and Liberica Native Image Kit, and helps to reduce the RAM consumption of Java applications by up to 30%. Alpaquita-based buildpacks for Java are also available. As for security, Alpaquita has some additional features such as kernel hardening. There is also a set of hardened images with minimized attack surface, provenance data, and SLA for patches both for OS and runtime from one team. From a maintenance perspective, Alpaquita comes in Stream, which is a rolling, continuously updated release, and LTS with four years of support. The distribution is open source, free-to-use, and is covered by EULA. Commercial support is also available from the BellSoft team. The drawback might be the limited choice of packages in the repository. Chiseled Ubuntu Chiseled Ubuntu is Canonical’s way to take the best of two worlds. It is almost a distroless base image stripped down to the essentials, but still a well-known and beloved Ubuntu distribution with a broad ecosystem, release roadmap, and LTS. With the tool called chisel, one can cut out a custom base image with only those packages required for the application to run. Canonical’s documentation and official images emphasize that chiselled images often include no shell and no package manager in the final runtime image, which contributes to minimized attack surface. The final images can be about 5–6 megabytes in size, depending on the runtime stack you target. Due to the fact that it is Ubuntu-based, the distribution uses glibc and enjoys Ubuntu’s broad compatibility. Chiseled Ubuntu is open source, the images are built from Ubuntu packages, so the contents are mostly open source packages under their respective licenses. The commercial support is available from Canonical, which might be appealing to teams that want a familiar ecosystem, minimal image, and enterprise support. Like with Alpine, Chiseled Ubuntu’s drawback comes from the strong side. To get a custom image, you need to cut out the OS yourself using the dedicated tool as there are no ready-to-use images. If the application changes, you may need to repeat the process. RHEL UBI Micro RHEL UBI Micro is Red Hat’s base image with a compressed size of about 10 MB. The image is part of the RHEL UBI family, so it is RHEL as you know it: glibc-based and seamlessly compatible with Red Hat’s infrastructure. But like Chiseled Ubuntu, UBI “micro” images are stripped down and contain only essential packages for running the application in a container. The images are updated regularly, LTS releases are based on the RHEL lifecycle model. Licensing might be an important nuance here. UBI images are described as freely redistributable, but under the UBI EULA, and support is part of Red Hat’s subscription ecosystem. In practice, teams may want to pick UBI Micro when they want the Red Hat supply chain and vendor alignment. Wolfi Wolfi is maintained by Chainguard. It is a container-first Linux “un-distro” as the vendor calls it, which was designed around modern supply-chain security needs and focuses on factors like provenance, SBOMs, and signing. A typical compressed image size for Wolfi is around 5 to 7 MB, depending on the architecture. It uses apk like Alpine, but unlike Alpine, it is based on glibc. That makes Wolfi a good option when you want minimal images without the surprises of the default musl implementation. Wolfi is the base on which Chainguard OS is built and used in Chainguard Containers — distroless images that are rebuilt daily and come with comprehensive provenance data. Wolfi’s releases are rolling. The emphasis is on fast package updates rather than versioned distribution releases. Chainguard documentation states that the images are rebuilt on a frequent schedule, commonly daily/nightly. On the other hand, there isn’t an LTS concept the way you’d see with a vendor enterprise distro. Wolfi is open source and freely available under the Apache License V2. Commercially, Chainguard has a paid offering around hardened “production” images with support commitments and patch SLAs. The caveat is the trade-off you may get with rolling updates. You get fresh images, but you should invest in reproducibility and pinning if you want stable deployments. Conclusion: Factors to Consider When Selecting a Linux Base Image To sum up, there is no single best Linux distribution for container images, only various options for different teams, workloads, and constraints. Some prioritize small size and simplicity. Others need compatibility with their existing infrastructure. For some, enterprise support matters the most. So, comparing Linux distributions by size only does not cover the broader picture of business requirements. When choosing a base Linux distro for containers, teams should pay attention to the following factors: The libc implementation. Selecting between musl vs glibc is a big decision point that may influence the performance to the better or worse, or cause compatibility problems.Update model and release cadence. Rolling vs stable vs LTS influences the way teams patch, test, and update images. In this case, you need to decide whether you need maximum freshness or a more predictable lifecycle.Security posture. Look at attack surface reduction, patch cadence, hardened versions, and supply chain features such as provenance, signing, and SBOM.Licensing. Some options are community distributions, while others are vendor-distributed images under EULAs. That may matter for compliance and internal policy reviews.Support. Decide whether you need vendor-backed support, can do with third-party support, or require no commercial support at all. This is often determined by organizational requirements.Ecosystem fit. The most optimal base image is usually the one that fits your CI/CD, scanning tools, and compliance requirements. In short, choosing a base Linux distro is a platform decision. The right choice is the one that aligns with your application’s compatibility needs, team’s operational model, and organization’s security and compliance requirements.

By Catherine Edelveis

Docker Secrets Management: From Development to Production

Most Docker tutorials show secrets passed as environment variables. It's convenient, works everywhere, and feels simple. It's also fundamentally insecure. Environment variables are visible to any process running inside the container. They appear in docker inspect output accessible to anyone with Docker socket access. Debugging tools log them. Child processes inherit them. And in many logging frameworks, they get written to log files where they persist indefinitely. Consider this common pattern: Shell docker run -e DATABASE_PASSWORD=SuperSecret123 myapp That password is now: Visible in docker inspect myappReadable by any process in the container via /proc/1/environInherited by every subprocess spawned by the applicationPotentially logged by the application's error handlingAvailable to anyone with read access to the Docker socket Screenshot of docker inspect showing environment variables with secrets visible This is not theoretical. In production pharmaceutical environments managing patient data under HIPAA, environment variable leakage through log aggregation systems has triggered compliance violations. Docker Swarm Secrets: The Native Solution Docker Swarm includes built-in secret management that addresses the environment variable problem through encryption and in-memory delivery. How Swarm Secrets Work When you create a secret in Swarm, the secret value is encrypted and stored in Swarm's distributed state (backed by Raft consensus). The secret is only decrypted on nodes running services that explicitly declare they need it. On those nodes, secrets are mounted as files in an in-memory tmpfs filesystem at /run/secrets/. This means: Encrypted at rest: Secrets are encrypted in Swarm's internal databaseEncrypted in transit: Secrets are transmitted over TLS between Swarm nodesNever written to disk: Secrets exist only in memory via tmpfsScoped access: Only containers declaring the secret can read itNo inspect visibility: docker inspect shows secret names, not values Important security note: While Swarm secrets are encrypted at rest, the encryption keys are managed by the Swarm itself and reside in manager node memory. This means an attacker with privileged access to a manager node could theoretically access them. However, this is still a massive improvement over environment variables, which are exposed at the filesystem and process level on every worker node. Example usage: Shell # Create a secret echo "SuperSecret123" | docker secret create db_password - # Deploy a service using the secret docker service create \ --name api \ --secret db_password \ myapp:latest # Inside the container cat /run/secrets/db_password # SuperSecret123 # From the host docker inspect api Terminal screenshot showing secret mounted at /run/secrets/ with permissions 400 File permissions: The secret file is mounted with 400 permissions (read-only, owner-only) and owned by root. This means only the container's root user — or a process that has dropped privileges after reading — can access it. If your application runs as a non-root user (best practice), you'll need to read the secret during initialization while still running as root, then drop privileges. Screenshot of docker inspect output showing SecretName but no SecretValue Production reality: In pharmaceutical cluster environments, Swarm secrets enable compliance with data protection requirements by ensuring database credentials are never written to disk and are only accessible to explicitly authorized services. When Swarm Secrets Are Enough Swarm secrets work well for: Single-platform Docker deployments (not mixing VMs and containers)Static secrets that change infrequently (manual rotation is acceptable)Environments where Vault's operational complexity isn't justifiedSimple microservice architectures where each service needs 2-5 secrets Swarm secrets are Docker-native, require no external dependencies, and work on single-node "Swarms" (you can run docker swarm init on a single host to get secret management without clustering). HashiCorp Vault: When You Need More Vault is an external secret manager that adds capabilities Swarm secrets don't have: dynamic secret generation, automatic rotation, fine-grained access policies, and audit logging. Dynamic Secrets: The Key Differentiator The most powerful Vault feature is dynamic secrets. Instead of storing a static database password, Vault generates temporary credentials on-demand that expire automatically. Traditional approach - Static password stored in Vault: Shell vault kv put secret/db password=SuperSecret123 Dynamic approach - Vault generates temporary credentials: Shell vault read database/creds/app-role # Returns: # username: v-token-app-role-8h3k2j # password: A1Bb2Cc3Dd4Ee5Ff (auto-generated) # lease_duration: 3600 (expires in 1 hour) Terminal output showing Vault returning temporary username/password with lease_duration When the application requests database credentials from Vault, Vault connects to the database and creates a temporary user with the exact permissions the application needs. That user exists for a limited time (configurable, typically 1-24 hours), then Vault automatically revokes it. This solves two problems: Credential sprawl: No static password shared across environmentsBlast radius: Compromised credentials expire automatically Audit Logging for Compliance Vault logs every secret access. This is required for SOC 2 Type II and PCI DSS compliance, where auditors need proof of who accessed which secrets when. Example Vault audit log entry: JSON { "time": "2026-03-30T19:45:12Z", "type": "response", "auth": { "token_type": "service", "entity_id": "api-service" }, "request": { "path": "database/creds/app-role" }, "response": { "secret": true } } Vault audit log showing timestamp, entity_id, request path, and response metadata Every access is logged with timestamps, the requesting identity, and the secret path. This log is write-only (even Vault admins can't modify it) and can be exported to SIEM systems. When Vault Is Justified Use Vault when: You need dynamic database credentials (most important use case)Compliance requires audit trails (SOC 2, PCI DSS, HIPAA)You're managing secrets across multiple platforms (Docker + VMs + Kubernetes)Automated secret rotation is requiredYou have dedicated operations staff to maintain Vault infrastructure Vault's operational complexity is real. It requires: High-availability deployment (3+ nodes)Secure initialization and unsealing proceduresTLS certificate managementBackup and disaster recovery planningAccess policy maintenance For a 5-person startup, this overhead usually isn't justified. For Fortune 500 pharmaceutical operations managing hundreds of microservices accessing regulated data stores, it's mandatory infrastructure. BuildKit Secret Mounts: Build-Time Security Build-time secrets are different. You need credentials during docker build to access private npm registries, clone private git repos, or download proprietary dependencies. These secrets should never persist in the final image. BuildKit secret mounts solve this. BuildKit has been the default builder since Docker Engine 23.0, so if you're on a modern Docker version, you already have this capability — no special flags or setup required. Dockerfile: Dockerfile FROM node:18.20.5-alpine3.20 WORKDIR /a COPY package.json* RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \ npm install --only=production && \ npm cache clean --force COPY app.js ./ RUN addgroup -g 1001 -S nodejs && \ adduser -S nodejs -u 1001 && \ chown -R nodejs:nodejs /app USER node CMD ["node", "app.js"] Build the image with the secret: Shell docker build --secret id=npmrc,src=$HOME/.npmrc -t myapp . The .npmrc file is available inside the container during npm install, but it's not written to any image layer. It's not in the final image. It's not in docker history. It existed only for the duration of that one RUN instruction. Diagram showing BuildKit secret mount lifecycle - secret available during RUN, then immediately discarded Why BuildKit Secrets Matter: Before BuildKit secrets, developers used ARG or multi-stage builds with complex cleanup scripts. Both leaked secrets into intermediate layers visible in docker history. BuildKit secrets are ephemeral by design — they can't leak because they never persist. Common Build-Time Secret Patterns Private npm/pip registries: Dockerfile RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \ npm install SSH keys for private git repos: Dockerfile RUN --mount=type=secret,id=ssh_key,target=/tmp/key \ cp /tmp/key /root/.ssh/id_rsa && \ chmod 600 /root/.ssh/id_rsa && \ git clone [email protected]:company/private-repo.git && \ rm /root/.ssh/id_rsa API tokens for downloading artifacts: Dockerfile RUN --mount=type=secret,id=api_token \ TOKEN=$(cat /run/secrets/api_token) && \ curl -H "Authorization: Bearer $TOKEN" \ https://api.company.com/artifact.tar.gz -o /tmp/artifact.tar.gz Secret Scanning: Prevention Layer Despite proper secret management, developers still accidentally commit secrets. GitLeaks and similar tools scan repositories for patterns matching credentials. Dockerfile # Scan current repository docker run -v $(pwd):/path zricethezav/gitleaks:latest \ detect --source /path --verbose GitLeaks terminal output showing detected AWS key and GitHub token with file paths and line numbers GitLeaks detects: AWS keys (AKIA...)GitHub tokens (ghp_...)Stripe keys (sk_live_...)Private keys (-----BEGIN PRIVATE KEY-----)Database connection stringsHigh-entropy strings (potential secrets) Prevention via Pre-Commit Hooks The most effective scanning happens before commit: .pre-commit-config.yaml: YAML repos: - repo: https://github.com/gitleaks/gitleaks rev: v8.18.0 hooks: - id: gitleaks Install the hook: Shell pre-commit install # Now every git commit runs GitLeaks first git commit -m "Add config" # GitLeaks scan... # ERROR: Secret detected in config.yml Terminal showing GitLeaks blocking a commit with "ERROR: Secret detected in config.yml Pre-commit hooks prevent secrets from entering git history. CI/CD scanning catches what pre-commit missed. Together, they create defense in depth. Critical: Secrets in Git Are Permanent Even after deleting a file containing secrets, those secrets remain in git history indefinitely. The only remediation is to rotate the secret (assume it's compromised) and optionally rewrite history with git filter-branch or BFG Repo-Cleaner. Layered Approach for Production Production environments don't choose one solution. They layer multiple approaches: Secret typesolutionwhy Build-time (npm, SSH) BuildKit Mounts Ephemeral, can't leak into image Simple service secrets Docker Swarm Secrets Native, encrypted, no external deps Database credentials Vault Dynamic Secrets Auto-expiring, audit trail Compliance-regulated Vault + Audit Logs SOC 2, PCI DSS requirements Detection GitLeaks + Pre-commit Prevent accidents Architecture diagram showing layered secrets approach - BuildKit for builds, Swarm for simple secrets, Vault for DB, GitLeaks for prevention Example Architecture for a Pharmaceutical Application: CI/CD pipeline: BuildKit mounts for private npm registry accessAPI service: Swarm secret for JWT signing key (static, rotated quarterly)Database access: Vault dynamic credentials (expire every 4 hours, audit logged)Pre-commit hooks: GitLeaks scanning on every developer commitCI/CD gates: Automated GitLeaks scan on every pull request Key Takeaways Environment variables are not secrets. They're visible to any process, appear in docker inspect, and get logged. Use them for configuration, not credentials. Swarm secrets are underutilized. Most teams don't realize Docker has native secret management that works on single nodes. No Vault complexity required for simple use cases. Vault's value is dynamic secrets. Static secret storage is a nice feature. Dynamic database credentials that auto-expire are transformative for security posture. BuildKit secrets prevent build leakage. Before BuildKit, build-time secrets inevitably leaked into image layers. BuildKit mounts are ephemeral by design. Secrets in git are forever. File deletion doesn't remove secrets from history. Rotate immediately if detected. Pre-commit hooks prevent the problem. Layer your approach. Production systems use BuildKit for builds, Swarm for simple secrets, Vault for dynamic credentials, and GitLeaks for prevention. Each solves a different problem. Hands-On Practice Want to practice these concepts? Lab 10 in the Docker Security Practical Guide covers all five scenarios: Anti-patterns (environment variables, docker history leaks)Swarm secrets (encrypted, tmpfs-mounted)Vault integration (dynamic credentials, audit logging)BuildKit secret mounts (ephemeral build-time secrets)Secret scanning with GitLeaks (pre-commit hooks, CI/CD) All labs are executable on Docker Desktop (macOS/Windows/Linux). Note: Lab 10 covers Vault in development mode to demonstrate core concepts. For production Vault deployment with high availability, TLS, dynamic database credentials, and audit logging integration, see the upcoming Lab 11 (Tier 2 Deep-Dive) in the same repository. GitHub: https://github.com/opscart/docker-security-practical-guide/tree/master/labs/10-secrets-management Complete guide: https://opscart.com/docker-security-guide/docker-secrets-management/

By Shamsher Khan

CORE

The ORM Is Over: AI-Written SQL Is the New Data Access Layer

Object-relational mappers (ORMs) were widely adopted because they abstracted away the need to deal with databases as separate, different things. You could just define your models, use relationships, and the ORM would generate SQL under the hood. This is a good solution for simple CRUD applications and for quickly getting started, but as your application grows and becomes more complex, the abstraction starts leaking. You may need to use some vendor-specific features that ORM doesn’t support, or when query performance becomes the bottleneck of your application. So, although it’s a trade-off, it might be a good idea to start with SQL instead. Why ORM Became the Standard ORMs solved three pain points: Speed: We don't have to write SQL for every CRUD endpoint.Consistency: Models and relationships gave teams a shared pattern.Safety: Parameter binding and abstractions reduced SQL injection risk. But in production systems, ORMs often become a tax: We debug the generated SQL anywayWe learn both ORM and SQLPerformance issues show up lateComplex queries turn into unreadable chains of ORM operators Core Problem: ORMs Are Leaky Abstractions Relational databases don’t behave like objects. When the product grows, and data becomes complex, we start needing: Partial and tuned indexesQuery optimizationsTransaction locking mechanism Materialized views This is where ORMs start falling apart: The query looks fine in code, but is slow in production.We get N+1 queries without realizing it.Seemingly small model changes change SQL output dramatically.Eager loading becomes a guessing game.ORM becomes a SQL generator that we don't fully control AI Makes Writing SQL Cheap SQL lost to ORMs because it was: Slower to write the complex queries Harder to reviewMore prone to errors Now AI flips the coin, and with the use of AI, developers can: Describe the intent, and AI generates the queryAI can generate multiple SQL Query versionsGet thorough reviews and explanationsPerformance optimized queries The New Stack: SQL First, AI-Assisted A modern ORM replacement is no longer a chaos rather structured SQL platform: What We Keep MigrationsSchema evolutionTransactionsValidationObservability What We Drop Heavy object mappingLeaky relationship abstractionsMagic loading behavior What We Add AI-assisted SQL generationQuery review as part of PRsType-safe wrappers for inputs/outputsPerformance guardrails (timeouts, limits) How We Account For This in Our Backend 1. Store SQL Close to the Domain (Not Scattered) Organize by feature/domain: queries/user.sqlqueries/loans.sqlqueries/payments.sql 2. Always Execute Parameterized SQL Never build SQL by concatenating strings with user input. Examples of safe patterns: Postgres: $1, $2, ...MySQL: ?named params if your driver supports it 3. Use AI as Your Query Pair Programmer Your best AI prompts are specific: "Write Postgres SQL to fetch X with constraints Y, include pagination, avoid N+1.""Explain why this query might seq-scan and how to fix it.""Provide two variants: CTE and non-CTE.""Assume indexes exist on columns A and B; suggest missing indexes." Then treat the output like human code: review, test, benchmark. Guardrails That Make This Production-Safe If you're going SQL-first, these guardrails matter: Security Guardrails parameter binding everywherestrict input validation at request boundariesleast-privilege DB roles (read-only for read paths, separate writer roles) Performance Guardrails statement timeoutsrow limits for endpointspagination by defaultslow query logging + APM tracesmonitoring bechmarks for important transactions Maintainability Guardrails SQL formatting/lintingintegration tests that hit a real DBconsistent query naming conventionscode review checklist Type Safety at Risk Without ORM? There is no denial about type safety, which ORM provides, but most production bugs aren’t type bugs. They are: Wrong joinsMissing constraintsInconsistent transaction boundariesRace conditionsSlow queries and timeouts We can still get type safety without an ORM by: Generating types from schema or queriesWrapping SQL calls in typed repository functionsValidating outputs schema When ORMs Still Win ORMs still make sense when: Rapid CRUD app development Schema is small and stableWe don't have complex search queriesWe are optimizing for onboarding speed over DB-level control But once our system scales and starts caring about: PerformanceObservabilityCorrectness under concurrencyDB-specific features Most teams end up writing raw SQL anyway. Closing Thoughts In the end, we're not calling ORMs "bad" per se — it's just that, for what backend work looks like today, treating the generated SQL as a first-class artifact has proven more productive. ORMs shine with quick CRUD and simple domains, but as soon as you have a real program that cares about its data in production, you care about the SQL, the indexes, and the query plan. And now that AI has made SQL less painful to generate and iterate on, especially on complex queries, the SQL often turns out to be a more efficient default: We can usually generate it more quickly, look at it with a code diff, run it more safely via parameter binding, and ship it with performance guardrails and observability. The thinner, more predictable data access layer you get in return is easier to debug, easier to optimize, and more faithful to the actual performance profile of the database.

By Satyam Nikhra

Spark on AmpereOne® M Arm Processors Reference Architecture

Introduction Arm technology now powers a broad spectrum of on-premises and cloud server workloads. Building on Ampere Computing's previous reference architecture, which demonstrated that Apache Spark on Ampere Altra – 128C (Ampere Altra 128 Cores) processors delivers superior performance per rack, lower power consumption, and optimized CapEx and OpEx, this paper evaluates and extends that analysis to showcase Spark performance on the latest generation of AmpereOne® M processors. Scope and Audience This document describes the process of setting up, tuning, and evaluating Spark performance using a testbed powered by AmpereOne® M processors. It includes a comparative analysis of the performance benefits of the 12-channel AmpereOne® M processors relative to their predecessors, specifically Ampere Altra – 128C processors. Additionally, the paper examines the Spark performance improvements achieved by using a 64KB page-size kernel over standard 4KB page-size kernels. We outline the installation and tuning procedures for deploying Spark on both single-node and multi-node clusters. These recommendations are intended as general guidelines, and configuration parameters can be further optimized based on specific workloads and use cases. This document is intended for sales engineers, IT and cloud architects, IT and cloud managers, and customers seeking to leverage the performance and power efficiency advantages of Ampere Arm servers across their IT infrastructure. It provides practical guidance and technical insights for professionals interested in deploying and optimizing Arm-based Spark solutions. AmpereOne® M Processors AmpereOne® M is part of the AmpereOne® M family of high-performance server-class processors, designed to deliver exceptional performance for AI Compute and a wide range of mainstream data center workloads. Data-intensive applications such as Hadoop and Apache Spark benefit directly from the 12 DDR5 memory channels, which provide the high memory bandwidth required for large-scale data processing. AmpereOne® M processors introduce a new platform architecture with a higher core count and additional memory channels, differentiating it from earlier Ampere platforms while preserving Ampere’s Cloud Native processing principles. Designed from the ground up for cloud efficiency and predictable scaling, AmpereOne® M employs a one-to-one mapping between vCPUs and physical cores, ensuring consistent performance without resource contention. With up to 192 single-threaded cores and twelve DDR5 channels delivering 5600 MT/s, AmpereOne® M delivers a sustained throughput required for demanding workloads such as Spark, though also including modern AI inference relying on Large Language Models (LLM). AmpereOne® M also emphasizes exceptional performance-per-watt, helping reduce operational costs, energy consumption, and cooling requirements in modern data centers. Apache Spark Apache Spark is a unified data processing and analytics framework used for data engineering, data science, and machine learning workloads. It can operate on a single node or scale across large clusters, making it suitable for processing large and complex datasets. By leveraging distributed computing, Spark efficiently parallelizes data processing tasks across multiple nodes, either independently or in combination with other distributed computing systems. Spark utilizes in-memory caching, which allows for quick access to data and optimized query execution, enabling fast analytic queries on datasets of any size. The framework provides APIs in popular programming languages such as Java, Scala, Python, and R, making it accessible to the broad developer community. Spark supports various workloads, including real-time analytics, batch processing, interactive queries, and machine learning, offering a comprehensive solution for modern data processing needs. Spark supports multiple deployment models. It can run as a standalone cluster or integrate with cluster management and orchestration platforms such as Hadoop YARN, Kubernetes, and Docker. This flexibility allows Spark to adapt to diverse infrastructure environments and workload requirements. Spark Architecture and Components Figure 1 Spark Driver The Spark Driver serves as the central controller of the Spark execution engine and is responsible for managing the overall state of the Spark cluster. It interacts with the cluster manager to acquire the necessary resources, such as virtual CPUs (vCPUs) and memory. Once the resources are obtained, the Driver launches the executors, which are responsible for executing the actual tasks of the Spark application. Additionally, the Spark Driver plays a crucial role in maintaining the state of the application running on the cluster. It keeps track of various important information, such as the execution plan, task scheduling, and the data transformations and actions to be performed. The Driver coordinates the execution of tasks across the available executors, ensuring efficient data processing and computation. Spark Driver, hence, acts as a control unit orchestrating the execution of the Spark application on the cluster and maintaining the necessary states and communication with the cluster manager and executors. Spark Executors Spark Executors are responsible for executing the tasks assigned to them by the Spark Driver. Once the Driver distributes the tasks across the available Executors, each Executor independently processes its assigned tasks. The Executors run these tasks in parallel, leveraging the resources allocated to them, such as CPU and memory. They perform the necessary computations, transformations, and actions specified in the Spark application code. This includes operations like data transformations, filtering, aggregations, and machine learning algorithms, depending on the nature of the tasks. During the execution of the tasks, the Executors communicate with the Driver, providing updates on their progress and reporting the results of each task. Cluster Manager The Cluster Manager is responsible for maintaining the cluster of machines on which the Spark applications run. It handles resource allocation, scheduling, and management of the Spark Driver and Executors, ensuring efficient execution of Spark applications on the available cluster resources. When a Spark application is submitted, the Driver communicates with the Custer Manager to request the necessary resources, such as CPU, memory, and storage, to run the application. It ensures that the resources are distributed effectively to meet the requirements of the Spark application. This includes tasks such as assigning containers or worker nodes to execute the Spark Executors and ensuring that the required dependencies and configurations are in place. Spark RDD Spark uses a concept called Resilient Distributed Dataset (RDD), an abstraction that represents an immutable collection of objects that can be split across a cluster. RDDs can be created from various data sources, including SQL databases and NoSQL stores. Spark Core, which is built upon the RDD model, provides essential functionalities such as mapping and reducing operations. It also offers built-in support for joining data sets, filtering, sampling, and aggregation, making it a powerful tool for data processing. When executing tasks, Spark splits them into smaller subtasks and distributes them across multiple executor processes running on the cluster. This enables the parallel execution of tasks across the available computational resources, resulting in improved performance and scalability. Spark Core Spark Core serves as the underlying execution engine for the Spark platform, forming the basis for all other Spark functionality. It offers powerful capabilities such as in-memory computing and the ability to reference datasets stored on external storage systems. One of the key components of Spark Core is the resilient distributed dataset (RDD), which serves as the primary programming abstraction in Spark. RDDs enable fault-tolerant and distributed data processing across a cluster. Spark Core provides a wide range of APIs for creating, manipulating, and transforming RDDs. These APIs are available in multiple programming languages, including Java, Python, Scala, and R. This flexibility allows developers to work with Spark Core using their preferred language and leverages the rich ecosystem of libraries and tools available in those languages. Spark Scheduler The Spark Scheduler is a vital component responsible for task scheduling and execution. It uses a Directed Acyclic Graph (DAG) and employs a task-oriented approach for scheduling tasks. The Scheduler analyzes the dependencies between different stages and tasks of a Spark application, represented by the DAG. It determines the optimal order in which tasks should be executed to achieve efficient computation and minimize data movement across the cluster. By understanding the dependencies and requirements of each task, the Scheduler assigns resources, such as CPU and memory, to the tasks. It considers factors like data locality, where possible, to reduce network overhead and improve performance. The task-oriented approach of the Spark Scheduler allows it to break down the application into smaller, manageable tasks and distribute them across the available resources. This enables parallel execution and efficient utilization of the cluster's computing power. Spark SQL Spark SQL is a widely used component of Apache Spark that facilitates the creation of applications for processing structured data. It adopts a data frame approach and allows efficient and flexible data manipulation. One of the key features of Spark SQL is its ability to interface with various data storage systems. It provides built-in support for reading and writing data from and to different datastores, including JSON, HDFS, JDBC, and Parquet. This makes it easy to work with structured data residing in different formats and storage systems. Additionally, Spark SQL extends its connectivity beyond the built-in datastores. It offers connectors that enable integration with other popular data stores such as MongoDB, Cassandra, and HBase. These connectors allow users to seamlessly interact with and process data stored in these systems using Spark SQL's powerful querying and processing capabilities. Spark MLlib In addition to its core functionalities, Apache Spark includes bundled libraries for machine learning and graph analysis techniques. One such library is MLlib, which provides a comprehensive framework for developing machine learning pipelines. MLlib simplifies the implementation of machine learning workflows by offering a wide range of tools and algorithms. It simplifies the implementation of feature extraction and transformations on structured datasets and offers a wide range of machine learning algorithms. MLlib empowers developers to build scalable and efficient machine learning workflows, enabling them to leverage the power of Spark for advanced analytics and data-driven applications. Distributed Storage Spark does not provide its own distributed file system. However, it can effectively utilize existing distributed file systems to store and access large datasets across multiple servers. One commonly used distributed file system with Spark is the Hadoop Distributed File System (HDFS). HDFS allows for the distribution of files across a cluster of machines, organizing data into consistent sets of blocks stored on each node. Spark can leverage HDFS to efficiently read and write data during its processing tasks. When Spark processes data, it typically copies the required data from the distributed file system into its memory. By doing so, Spark reduces the need for frequent interactions with the underlying file system, resulting in faster processing compared to traditional Hadoop MapReduce jobs. As the dataset size increases, additional servers with local disks can be added to the distributed file system, allowing for horizontal scalability and improved performance. Spark Jobs, Stages, and Tasks In a Spark application, the execution flow is organized into a hierarchical structure consisting of Jobs, Stages, and Tasks. A Job represents a high-level unit of work within a Spark application. It can be seen as a complete computation that needs to be performed, involving multiple Stages and transformations on the input data. A Stage is a logical division of tasks that share the same shuffle dependencies, meaning they need to exchange data with each other during execution. Stages are created when there is a shuffle operation, such as a groupBy or a join, that requires data to be redistributed across the cluster. Within each Stage, there are multiple Tasks. A Task represents the smallest unit of work in Spark, representing a single operation that can be executed on a partition of the data. Tasks are typically executed in parallel across multiple nodes in the cluster, with each node responsible for processing a subset of the data. Spark intelligently partitions the data and schedules Tasks across the cluster to maximize parallelism and optimize performance. It automatically determines the optimal number of Tasks and assigns them to available resources, considering factors such as data locality to minimize data shuffling between nodes. Spark handles the management and coordination of Tasks within each stage, ensuring that they are executed efficiently and leveraging the parallel processing capabilities of the cluster. Figure 2 Shuffle boundaries introduce a barrier where Stages/Tasks must wait for the previous stage to finish before they fetch map outputs. In the above diagram, Stage 0 and Stage 1 are executed in parallel, while Stage 2 and Stage 3 are executed sequentially. Hence, Stage 2 has to wait until both Stage 0 and Stage 1 are complete. This execution plan is evaluated by Spark. Spark Test Bed The Spark cluster was set up for performance benchmarking. Equipment Under Test Cluster nodes: 3CPU: AmpereOne® MSockets/node: 1Cores/socket: 192Threads/socket: 192CPU speed: 3200 MHzMemory channels: 12Memory/node: 768 GB (12 x 64GB DDR5-5600, 1DPC)Network card/node: 1 x Mellanox ConnectX-6OS storage/node: 1 x Samsung 960GB M.2Data storage/mode: 4 x Micron 7450 Gen 4 NVME, 3.84 TBKernel version: 6.8.0-85Operating system: Ubuntu 24.04.3YARN version: 3.3.6Spark version: 3.5.7JDK version: JDK 17 Spark Installation and Cluster Setup We set up the cluster with an HDFS file system. Hence, we installed Spark as a Hadoop user and configured the disks for HDFS. OS Install The majority of modern open-source and enterprise-supported Linux distributions offer full support for the AArch64 architecture. To install your chosen operating system, use the server Kernel-based Virtual Machine (KVM) console to map or attach the OS installation media, and then follow the standard installation procedure. Networking Setup Set up a public network on one of the available interfaces for client communication. This can be used to log in to any of the servers where client communication is needed. Set up a private network for communication between the cluster nodes. Storage Setup Choose a drive of your choice for the OS to install, clear any old partitions, reformat, and choose the disk to install the OS. Here, a Samsung 960 GB drive (M.2) was chosen for the OS installation on each server. Add additional high-speed NVMe drives to support the HDFS file system. Create Hadoop User Create a user named “hadoop” as part of the OS Install. This user was used for both Hadoop and Spark daemons on the test bed. Post-Install Steps Perform the following post-install steps on all the nodes on OS after the install. yum or apt update on the nodes.Install packages like dstat, net-tools, lm-sensors, linux-tools-generic, python, sysstat for your monitoring needs.Set up ssh trust between all the nodes.Update /etc/sudoers file for nopasswd for hadoop user.Update /etc/security/ limits.conf per Appendix.Update /etc/sysctl.conf per Appendix.Update scaling governor and hugepages per Appendix.If necessary, make changes to /etc/rc.d to keep the above changes permanent after every reboot.Set up NVMe disks as an XFS file system for HDFS. Create a single partition on each of the NVMe disks with fdisk or parted.Create a file system on each of the created partitions using mkfs.xfs -f /dev/nvme[0-n]n1p1.Create directories for mounting as mkdir -p /root/nvme[0-n]1p1. d. Update /etc/fstab with entries and mount the file system. The UUID of each partition in fstab can be extracted from the blkid command.Change ownership of these directories to the ‘hadoop’ user created earlier. Spark Install Download Hadoop 3.3.6 from the Apache website, Spark 3.5.7 from Apache Spark, and JDK11 and JDK17 for Arm64/Aarch64. We will use JDK11 for Hadoop and JDK17 for Spark installs. Extract the tarball files under the Hadoop user home directory. Update Spark and Hadoop configuration files in ~/hadoop/spark/conf and ~/hadoop/etc/hadoop/ and environment parameters in .bashrc per Appendix. Depending on the hardware specifications of cores, memory, and disk capacities, these may have to be altered. Update the Workers’ files to include the set of data nodes. Run the following commands: Shell hdfs namdenode -format scp -r ~/hadoop <datanodes>:~/hadoop ~/hadoop/sbin/start-all.sh ~/spark/sbin/start-all.sh This should start Spark Master, Worker, and other Hadoop daemons. Performance Tuning Spark is a complex system where many components interact across various layers. To achieve optimal performance, several factors must be considered, including BIOS and operating system settings, the network and disk infrastructure, and the specific software stack configuration. Experience with Hadoop and Spark significantly helps in fine-tuning these settings. Keep in mind that performance tuning is an ongoing, iterative process. The parameters in the Appendix are provided as starting reference points, gathered from just a few initial tuning cycles. Linux Occasionally, there can be conflicts between the subcomponents of a Linux system, such as the network and disk, which can impact overall performance. The objective is to optimize the system to achieve optimal disk and network throughput and identify and resolve any bottlenecks that may arise. Network To evaluate the network infrastructure, the iperf utility can be utilized to conduct stress tests. Adjusting the TX/RX ring buffers and the number of interrupt queues to align with the cores on the NUMA node where the NIC is located can help optimize performance. However, if the BIOS setting is already configured as chipset-ANC in a monolithic manner, these modifications may not be necessary. Disks Aligned partitions: Partitions should be aligned with the storage's physical block boundaries to maximize I/O efficiency. Utilities like parted can be used to create aligned partitions.I/O queue settings: Parameters such as the queue depth and nr_requests (number of requests) can be fine-tuned via the /sys/block//queue/ directory paths to control how many I/O operations the kernel schedules for a storage device.Filesystem mount options: Utilizing the noatime option in the /etc/fstab file is critical for Hadoop and HDFS, as it prevents unnecessary disk writes by disabling the recording of file access timestamps. The fio (flexible I/O tester) tool is highly effective for benchmarking and validating the performance of the disk subsystem after these changes are implemented. Spark Configuration Parameters There are several tunables on Spark. Only a few of them are addressed here. Tune your parameters by observing the resource usage from http://:4040. Using Data Frames Over RDD It is preferred to use Datasets or Data Frames over RDD, which include several optimizations to improve the performance of Spark workloads. Spark data frames can handle the data better by storing and managing it efficiently, as they maintain the structure of the data and column types. Using Serialized Data Formats In Spark jobs, a common scenario involves writing data to a file, which is then read by another job and written to another file for subsequent Spark processing. To optimize this data flow, it is recommended to write the intermediate data into a serialized file format such as Parquet. Using Parquet as the intermediate file format can yield improved performance compared to formats like CSV or JSON. Parquet is a columnar file format designed to accelerate query processing. It organizes data in a columnar manner, allowing for more efficient compression and encoding techniques. This columnar storage format enables faster data access and processing, particularly for operations that involve selecting specific columns or performing aggregations. By leveraging Parquet as the intermediate file format, Spark jobs can benefit from faster transformation operations. The columnar storage and optimized encoding techniques offered by Parquet, as well as its compatibility with processing frameworks like Hadoop, contribute to improved query performance and reduced data processing time. Reducing Shuffle Operations Shuffling is a fundamental Spark operation that reorders data among different executors and nodes. This is necessary for distributed tasks such as joins, grouping, and reductions. This data redistribution is expensive in terms of resources, as it requires considerable disk IO, data packaging, and movement across the network. This is crucial to how Spark works, but can severely reduce performance if not understood and tuned properly. The spark.sql.shuffle.partitions configuration parameter is key to managing shuffle behavior. Found in spark-defaults.conf, this setting dictates the number of partitions created during shuffle operations. The optimal value varies significantly, depending on data volume, available CPU cores, and the cluster's memory capacity. Setting too many partitions results in a large number of smaller output files, potentially increasing overhead. Conversely, too few partitions can lead to individual partitions becoming excessively large, risking out-of-memory errors on executors. Optimizing shuffle performance involves an iterative process, carefully adjusting spark.sql.shuffle.partitions to strike the right balance between partition count and size for your specific workload. Spark Executor Cores The number of cores allocated to each Spark Executor is an important consideration for optimal performance. In general, allocating around 5 cores per Executor tends to be a fair allocation when using the Hadoop Distributed File System (HDFS). When running Spark alongside Hadoop daemons, it is vital to reserve a portion of the available cores for these daemons. This ensures that the Hadoop infrastructure functions smoothly alongside Spark. The remaining cores can then be distributed among the Spark Executors for executing data processing tasks. By striking a balance between allocating cores to Hadoop daemons and Spark executors, you can ensure that both systems coexist effectively, enabling efficient and parallel processing of data. It is important to adjust the allocation based on the specific requirements of your cluster and workload to achieve optimal performance. Spark Executor Instances The number of Spark executor instances represents the total count of executor instances that can be spawned across all worker nodes for data processing. To calculate the total number of cores consumed by a Spark application, you can multiply the number of executors by the cores allocated per executor. The Spark UI provides information on the actual utilization of cores during task execution, indicating the extent to which the available cores are being utilized. It is recommended to maximize this utilization based on the availability of system resources. By effectively using the available cores, you can boost your Spark application's processing power and make its overall performance better. It is crucial to look at the resources in your cluster and change the amount of executor instances and cores given to each executor to match. This ensures resources are used effectively and gets the most computational power out of your Spark application. Executor and Driver Memory The memory configuration for Spark's Driver and Executors plays a critical role in determining the available memory for these components. It is important to tune these values based on the memory requirements of your Spark application and the memory availability within your YARN scheduler and NodeManager resource allocation parameters. The Executor's memory refers to the memory allocated for each executor, while the Driver's memory represents the memory allocated for the Spark Driver. These values should be adjusted carefully to ensure optimal performance and avoid memory-related issues. When tuning the memory configuration, it is essential to consider the overall memory availability in your environment and consider any memory constraints imposed by the YARN scheduler and NodeManager settings. By aligning the memory allocation with the available resources, you can optimize the memory utilization and prevent potential out-of-memory errors or performance degradation (swapping or disk spills). It is recommended to monitor the memory usage with Spark UI and adjust the configuration iteratively to achieve the best performance for your Spark workload. Benchmark Tools We used both Intel HiBench and TPC-DS benchmarking tools to measure the performance of the clusters. TeraSort We used the HiBench benchmarking tool to measure the TeraSort performance. HiBench is a popular benchmarking suite specifically designed for evaluating the performance of Big Data frameworks, such as Apache Hadoop and Apache Spark. It consists of a set of workload-specific benchmarks that simulate real-world Big Data processing scenarios. For additional information, you can refer to this link. By running HiBench on the cluster, you can assess and compare its performance in handling various Big Data workloads. The benchmark results can provide insights into factors such as data processing speed, scalability, and resource utilization for each cluster. Update hibench.conf file, like scale, profile, parallelism parameters, and a list of master and slave nodes.Run ~HiBench/bin/workloads/micro/terasort/prepare/prepare.sh.Run ~HiBench/bin/workloads/micro/terasort/spark/run.sh. After executing the above, a file named hibench.report will be generated within the report directory. Additionally, a file named bench.log will contain comprehensive information regarding the execution. The cluster was using a data set of 3 TB. We measured the total power consumed, CPU power, CPU utilization, and other parameters like disk and network utilization using Grafana and IPMI tools. Throughput from the HiBench run was calculated for TeraSort in the following scenarios: Spark running on a single AmpereOne® M node compared with a single node Ampere Altra – 128C (prior generation)Spark running on a single AmpereOne® M node compared with a 3-node AmpereOne® M cluster to measure the scalabilitySpark running on a 3-node AmpereOne® M cluster with 64k page size vs 4k page size TPC-DS TPC-DS is an industry-standard decision-support benchmark that models various aspects of a decision-support system, including data maintenance and query processing. Its purpose is to assist organizations in making informed decisions regarding their technology choices for decision support systems. TPC benchmarks aim to provide objective performance data that is relevant to industry users. For more in-depth information, you can refer to this tpc.org/tpcds/. Similar to TeraSort testing, we conducted TPC-DS benchmark on AmpereOne® M processors using both single-node and 3-node cluster configurations to compare performance with the prior generation Ampere Altra – 128C processors and to assess scalability. Additional performance evaluations on the AmpereOne® M processor compared to Linux kernels configured with 64KB and 4KB page sizes. This test also used a 3 TB dataset across the cluster. To gain deeper insights into system performance, we monitored key performance metrics including total system power consumption, CPU power, CPU utilization, and network utilization. Performance Tests on 3 Node Clusters Figures 3 and 4 We evaluated Spark TeraSort performance using the HiBench tool. The tests were run on one, two, and three nodes with AmpereOne® M processors, and the earlier values obtained on Ampere Altra – 128C were compared. From Figure 3, it is evident that there is a 30% benefit of AmpereOne® M over Ampere Altra – 128C while running Spark TeraSort. This increase in performance can be attributed to a newer microarchitecture design, an increase in core count (from 128 to 192), and the 12-channel DDR5 design on AmpereOne® M (versus 8-channel DDR4 on Ampere Altra – 128C). The output for the 3x nodes configuration, as shown in Figure 4, was found to be close to three times the output of a single node. 64k Page Size Figure 5 We observed a significant performance increase, approximately 40%, with 64k page size on Arm64 architecture while running Spark TeraSort benchmark. Most modern Linux distributions support largemem kernels natively. We have not observed any issues while running Spark TeraSort benchmarks on largemem kernels. Performance Per Watt on AmpereOne® M Figure 6 To evaluate the energy efficiency of the cluster, we computed the Performance-per-Watt (Perf/Watt) ratio. This metric is derived by dividing the cluster's measured throughput (megabytes per second) by its total power consumption (watts) during the benchmarking interval. In these assessments, we observed AmpereOne® M performing 35% better over its predecessor on the Spark TeraSort benchmark. OS Metrics While Running TeraSort Benchmark Figure 7 The above image is a snapshot from the Grafana dashboard captured while running the TeraSort benchmark. During the HiBench test, the systems achieved maximum CPU utilization up to 90% while running the TeraSort benchmark. We observed disk read/write activity of approximately 15 GB/s and network throughput of 20 GB/s. Since both observed I/O and network throughput were significantly below the cluster's scalable limits, the results confirm that the benchmark successfully pushed the CPU to its maximum capacity. We observed from the above graphs that AmpereOne® M not only drove disk and network I/O higher than Ampere Altra – 128C, but it also completed tasks considerably faster. Power Consumption Figure 8 The graph illustrates the power consumption of cluster nodes, the platform, and the CPU. The power was measured using the IPMI tool during the benchmark run. We observe that the AmpereOne® M clusters consumed more power than the Ampere Altra – 128C cluster. This is not surprising in that the latest generation AmpereOne® M systems have 50% more compute cores and support 50% more memory channels. Additionally, as shown earlier, this increased power usage also delivered notably higher TeraSort throughput as well as better power efficiency (perf/watt) on AmpereOne® M (Figure 6). TPC-DS Performance Figures 9 and 10 The TPC-DS benchmarking tool was used to execute the TPC-DS workload on the clusters. The performance evaluation was based on the total time required to execute all 99 SQL queries on the cluster. Queries on AmpereOne® M completed in 50% less time than those run on Ampere Altra – 128C. The TPC-DS scalability improvement observed between 1 and 3 nodes was less compared to the scalability seen with TeraSort. 64k Page Size Figure 11 TPC-DS queries got a 9% boost by moving to a 64k page size kernel. Conclusion This paper presents a reference architecture for deploying Spark on a multi-node cluster powered by AmpereOne® M processors and compares the results with an earlier deployment based on Ampere Altra 128C processors. The latest TeraSort benchmark results reinforce the conclusions of earlier studies, demonstrating that Arm64-based data center processors provide a compelling, high-performance alternative to traditional x86 systems for Big Data workloads. Extending this analysis, the evaluation of the 12‑channel DDR5 AmpereOne® M platform shows measurable improvements in both raw throughput and performance-per-watt compared to previous-generation processors. These gains confirm that the AmpereOne® M is a groundbreaking platform designed for data centers and enterprises that prioritize performance, efficiency, and sustainability. Big Data workloads demand substantial computational resources and persistent storage, and by deploying these applications on Ampere processors, organizations benefit from both scale-up and scale-out architectures, enabling efficient growth while maintaining consistent throughput. For more information, visit our website at https://www.amperecomputing.com. If you’re interested in additional workload performance briefs, tuning guides, and more, please visit our Solutions Center at https://amperecomputing.com/solutions Appendix /etc/sysctl.conf Shell kernel.pid_max = 4194303 fs.aio-max-nr = 1048576 net.ipv4.conf.default.rp_filter=1 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack = 1 net.core.netdev_max_backlog = 25000 net.core.rmem_max = 2147483647 net.core.wmem_max = 2147483647 net.core.rmem_default = 33554431 net.core.wmem_default = 33554432 net.core.optmem_max = 40960 net.ipv4.tcp_rmem =8192 33554432 2147483647 net.ipv4.tcp_wmem =8192 33554432 2147483647 net.ipv4.tcp_low_latency=1 net.ipv4.tcp_adv_win_scale=1 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv4.conf.all.arp_filter=1 net.ipv4.tcp_retries2=5 net.ipv6.conf.lo.disable_ipv6 = 1 net.core.somaxconn = 65535 #memory cache settings vm.swappiness=1 vm.overcommit_memory=0 vm.dirty_background_ratio=2 /etc/security/limits.conf Shell * soft nofile 65536 * hard nofile 65536 * soft nproc 65536 * hard nproc 65536 Miscellaneous Kernel changes Shell #Disable Transparent Huge Page defrag echo never> /sys/kernel/mm/transparent_hugepage/defrag echo never > /sys/kernel/mm/transparent_hugepage/enabled #MTU 9000 for 100Gb Private interface and CPU governor on performance mode ifconfig enP6p1s0np0 mtu 9000 up cpupower frequency-set --governor performance .bashrc file Shell export JAVA_HOME=/home/hadoop/jdk export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$classpath export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin #HADOOP_HOME export HADOOP_HOME=/home/hadoop/hadoop export SPARK_HOME=/home/hadoop/spark export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH core-site.xml XML <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://<server1>:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>io.native.lib.available</name> <value>true</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.compression.codec.snappy.class</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> </configuration> hdfs-site.xml XML configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.blocksize</name> <value>536870912</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> </configuration> yarn-site.xml XML <configuration>  <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><server1></value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>81920</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>186</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>737280</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>186</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration> mapred-site.xml XML <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME, LD_LIBRARY_PATH=$LD_LIBRARY_PATH </value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib-examples/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/sources/*, $HADOOP_MAPRED_HOME/share/hadoop/common/*, $HADOOP_MAPRED_HOME/share/hadoop/common/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value><server1>:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value><server1>:19888</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.map.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.map.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx2g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.task.timeout</name> <value>6000000</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>32</value> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>32</value> </property> </configuration> spark-defaults.conf Shell spark.driver.memory 32g # used driver memory as 64g for TPC-DS spark.dynamicAllocation.enabled=false spark.executor.cores 5 spark.executor.extraJavaOptions=-Djava.net.preferIPv4Stack=true -XX:+UseParallelGC -XX:ParallelGCThreads=32 spark.executor.instances 108 spark.executor.memory 18g spark.executorEnv.MKL_NUM_THREADS=1 spark.executorEnv.OPENBLAS_NUM_THREADS=1 spark.files.maxPartitionBytes 128m spark.history.fs.logDirectory hdfs://<Master Server>:9000/logs spark.history.fs.update.interval 10s spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider spark.history.ui.port 18080 spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec spark.io.compression.snappy.blockSize=512k spark.kryoserializer.buffer 1024m spark.master yarn spark.master.ui.port 8080 spark.network.crypto.enabled=false spark.shuffle.compress true spark.shuffle.spill.compress true spark.sql.shuffle.partitions 12000 spark.ui.port 8080 spark.worker.ui.port 8081 spark.yarn.archive hdfs://<Master Server>:9000/spark-libs.jar spark.yarn.jars=/home/hadoop/spark/jars/*,/home/hadoop/spark/yarn/* hibench.conf Shell hibench.default.map/shuffle.parallelism 12000 # 3 node cluster hibench.scale.profile bigdata # the bigdata size configured as hibench.terasort.bigdata.datasize 30000000000 in ~/HiBench/conf/workloads/micro/terasort.conf Check out the full Ampere article collection here.

By RamaKrishna Nishtala

Reduce Frontend Complexity in ASP.NET Razor Pages Using HTMX

Modern web development often defaults to heavy client-side frameworks (React/Angular) for CRUD applications, introducing significant architectural overhead and “dependency hell.” By integrating HTMX with ASP.NET Razor Pages, we shifted DOM rendering back to the server, utilizing HTML fragments instead of JSON APIs. This approach eliminated complex client-side state management, reduced custom JavaScript by approximately 85%, and maintained a seamless, single-page application (SPA) feel with minimal infrastructure costs. The “Failure” of the Modern SPA Forest Engineers often find themselves trapped in a “forest” of NPM packages, Webpack configurations, and Vite build scripts just to render a simple list or validate a form field. In our initial architectural attempts, using a heavy SPA framework for a standard CRUD application manifested in: Excessive Latency: p95 latency spikes during hydration and client-side rendering.Maintenance Debt: Frequent breakage in the node_modules tree and complex JSON-to-HTML mapping logic.Log Bloat: Metrics showed that 40% of client-side errors originated from state synchronization issues between the API and the frontend. We realized that for most data-driven applications, the complexity of a virtual DOM was a liability, not an asset. Structured Decision: Server-Driven Interactivity To solve the complexity crisis, we evaluated shifting back to a server-centric model while retaining modern UX. Problem: Implementing “live” features (loaders, infinite scroll, real-time validation) without full page reloads.Constraints: High SEO requirements, limited frontend team bandwidth, and a strict .NET backend ecosystem.Decision: Adopt HTMX to extend HTML with declarative AJAX attributes.Trade-offs: We sacrificed the ability to perform complex offline-first state manipulation for a significantly simplified deployment pipeline and faster “Time to Interactive.” Beyond the Hype: Tabular Comparison of Modern Frontend Architectures Choosing the right frontend strategy depends heavily on team composition, SEO requirements, and the nature of interactivity. Below is a structured comparison of the core approach discussed (HTMX + Razor) against traditional and modern alternatives. Implementing HTMX in the Razor Pages Lifecycle How Do We Achieve SPA Behavior with hx-boost? The hx-boost attribute is the "low-hanging fruit" of HTMX. By adding hx-boost="true" to a navigation container, HTMX intercepts all anchor tags. Instead of a full postback, it fetches the page via AJAX and swaps the <body> content. This preserves the browser history while eliminating the "white flash" of a reload. How Do We Handle Real-Time Server-Side Validation? Instead of duplicating validation logic in JavaScript and C#, we use the blur trigger to hit a Razor Page handler: HTML <input type="text" name="slug" hx-get="/Profile/Create?handler=CheckSlug" hx-trigger="blur changed" hx-target="#slug-validation" /> <span id="slug-validation"></span> Insight Block: The Power of Partial Rendering By returning PartialView or Partial from a Razor Page handler, the server sends only the necessary HTML fragment. This reduces the payload size from a full ~50KB page to a ~200B fragment, drastically improving p95 response times for interactive elements. Technical Deep Dive: Out-of-Band (OOB) Swaps A common engineering challenge is updating multiple disconnected DOM elements simultaneously (e.g., adding an item to a list and updating a counter in the header). HTMX solves this with Out-of-Band Swaps. When the server responds with a primary fragment, it can include additional fragments marked with hx-swap-oob="true". HTMX will automatically find the targets by ID and update them, regardless of where they are in the DOM hierarchy. This allows for complex UI synchronization without a centralized client-side state store like Redux. Technical FAQs 1. Why not just use Blazor for server-side interactivity? While Blazor is powerful, it requires a persistent SignalR connection (WebSockets), which can be resource-intensive for high-traffic sites. HTMX uses standard stateless HTTP requests, making it more resilient to flaky connections and easier to scale horizontally. 2. How do you handle security and Anti-forgery tokens? ASP.NET Razor Pages requires an __RequestVerificationToken for POST requests. We handle this by adding a global HTMX configuration that includes the token in the X-XSRF-TOKEN header for every request initiated by HTMX. 3. Is HTMX compatible with existing CSS frameworks like Bulma or Bootstrap? Yes. Since HTMX works by swapping standard HTML, it is completely agnostic of your styling. We successfully implemented loaders and modals using Bulma classes and HTMX’s hx-indicator attribute. Internal Alternatives Considered Full React Rewrite: Rejected due to integration risk and the need for a separate API layer.ASP.NET Update Panels (Legacy): Rejected due to the heavy ViewState payload and lack of modern browser support.Plain Vanilla JS (Fetch API): Rejected as it would require writing significant “glue code” for DOM manipulation and event handling. Actionable Conclusion: Reusable Takeaways Start with hx-boost: Immediately improve the "feel" of existing Razor Pages applications by eliminating full-page refreshes.Leverage Handlers: Use OnGet[HandlerName] in your PageModel to return partials for specific UI components.Quantify Your JavaScript: Before reaching for a framework, ask if the interaction can be described as a “Server-Side DOM Swap.” If yes, HTMX is the more efficient tool. References HTMX Official DocumentationASP.NET Core Razor Pages Handlers DocumentationRFC 7231: Hypertext Transfer Protocol (HTTP/1.1)

By Akash Lomas

Integrating OpenID Connect (OIDC) Authentication in Angular and React

OpenID Connect (OIDC) is an identity layer on top of OAuth 2.0. If you’ve used “Sign in with Google/Microsoft/Okta/Auth0”, you’ve already used OIDC. In modern single-page apps (SPAs), the best practice is: Authorization Code Flow + PKCEStore tokens in memory (avoid localStorage when possible)Use the provider’s well-known discovery documentProtect routes and attach access tokens to API calls This guide shows an end-to-end setup for both Angular and React. Why Authorization Code Flow + PKCE? For SPAs, Authorization Code Flow with PKCE is the most secure and widely recommended option because: No client secrets are exposed in the browserProtects against authorization code interceptionWorks with modern identity providers (Okta, Auth0, Azure AD, Keycloak, Google)Aligns with OAuth 2.1 and security best practices Prerequisites from Your Identity Provider Before integrating OIDC, configure an SPA client in your Identity Provider (IdP) and collect: Issuer / Authority URLClient IDRedirect URIAngular: http://localhost:4200/auth/callbackReact: http://localhost:3000/auth/callbackPost-logout Redirect URIScopes: openid profile email (+ API scopes if required) Ensure the client: Uses PKCEIs marked as a public / SPA clientDoes not use a client secret Part 1: OpenID Connect in Angular Recommended Library angular-oauth2-oidc A mature and production-tested OIDC library for Angular. 1) Install Dependencies TypeScript npm i angular-oauth2-oidc 2. Configure OIDC Settings Create auth.config.ts: TypeScript import { AuthConfig } from 'angular-oauth2-oidc'; export const authConfig: AuthConfig = { issuer: 'https://YOUR_ISSUER', // e.g. https://idp.example.com/realms/myrealm clientId: 'YOUR_CLIENT_ID', redirectUri: window.location.origin + '/auth/callback', postLogoutRedirectUri: window.location.origin + '/', responseType: 'code', scope: 'openid profile email', strictDiscoveryDocumentValidation: false, // set true if issuer matches exactly and uses https in prod showDebugInformation: true, // disable in production requireHttps: false, // only for local dev on http }; 3. Create an Authentication Service Now create auth.service.ts: TypeScript import { Injectable } from '@angular/core'; import { OAuthService } from 'angular-oauth2-oidc'; import { authConfig } from './auth.config'; @Injectable({ providedIn: 'root' }) export class AuthService { constructor(private oauthService: OAuthService) {} async init(): Promise<void> { this.oauthService.configure(authConfig); this.oauthService.setupAutomaticSilentRefresh(); // Loads discovery document and tries login via redirect callback await this.oauthService.loadDiscoveryDocumentAndTryLogin(); } login(): void { this.oauthService.initCodeFlow(); } logout(): void { this.oauthService.logOut(); } get isLoggedIn(): boolean { return this.oauthService.hasValidAccessToken(); } get accessToken(): string { return this.oauthService.getAccessToken(); } get idTokenClaims(): object | null { return this.oauthService.getIdentityClaims() || null; } } 4) Initialize Auth on App Startup In app.module.ts: TypeScript import { APP_INITIALIZER, NgModule } from '@angular/core'; import { BrowserModule } from '@angular/platform-browser'; import { OAuthModule } from 'angular-oauth2-oidc'; import { AppComponent } from './app.component'; import { AuthService } from './auth/auth.service'; export function initAuth(auth: AuthService) { return () => auth.init(); } @NgModule({ declarations: [AppComponent], imports: [ BrowserModule, OAuthModule.forRoot({ resourceServer: { allowedUrls: ['http://localhost:8080/api'], // your backend API sendAccessToken: true, }, }), ], providers: [ { provide: APP_INITIALIZER, useFactory: initAuth, deps: [AuthService], multi: true }, ], bootstrap: [AppComponent], }) export class AppModule {} 5) Add a Callback Route Create a route that your redirect URI points to: TypeScript // app-routing.module.ts const routes: Routes = [ { path: 'auth/callback', component: EmptyCallbackComponent }, // ... ]; EmptyCallbackComponent can simply show a spinner. 6) Protect Routes with a Guard TypeScript import { Injectable } from '@angular/core'; import { CanActivate, Router } from '@angular/router'; import { AuthService } from './auth.service'; @Injectable({ providedIn: 'root' }) export class AuthGuard implements CanActivate { constructor(private auth: AuthService, private router: Router) {} canActivate(): boolean { if (!this.auth.isLoggedIn) { this.auth.login(); return false; } return true; } } Use it: TypeScript { path: 'dashboard', canActivate: [AuthGuard], component: DashboardComponent } Part 2: OpenID Connect in React Recommended Library react-oidc-context Built on oidc-client-ts, lightweight and idiomatic for React. 1) Install Dependencies TypeScript npm i react-oidc-context 2) Wrap Your App with AuthProvider TypeScript // main.tsx or index.tsx import React from "react"; import ReactDOM from "react-dom/client"; import { AuthProvider } from "react-oidc-context"; import App from "./App"; const oidcConfig = { authority: "https://YOUR_ISSUER", client_id: "YOUR_CLIENT_ID", redirect_uri: window.location.origin + "/auth/callback", post_logout_redirect_uri: window.location.origin + "/", response_type: "code", scope: "openid profile email", }; ReactDOM.createRoot(document.getElementById("root")!).render( <React.StrictMode> <AuthProvider {...oidcConfig}> <App /> </AuthProvider> </React.StrictMode> ); 3) Add a Callback Page TypeScript // AuthCallback.tsx import { useEffect } from "react"; import { useAuth } from "react-oidc-context"; export default function AuthCallback() { const auth = useAuth(); useEffect(() => { // react-oidc-context automatically processes the callback on this route }, []); if (auth.isLoading) return <div>Signing you in…</div>; if (auth.error) return <div>Error: {auth.error.message}</div>; // Once user is loaded, route away (your router logic here) return <div>Signed in. You can close this page.</div>; } 4) Create a Simple Route Guard Pattern TypeScript import { useAuth } from "react-oidc-context"; export function RequireAuth({ children }: { children: React.ReactNode }) { const auth = useAuth(); if (auth.isLoading) return <div>Loading…</div>; if (!auth.isAuthenticated) { auth.signinRedirect(); return null; } return <>{children}</>; } Usage: XML <RequireAuth> <Dashboard /> </RequireAuth> 5) Call APIs with the Access Token TypeScript import { useAuth } from "react-oidc-context"; export function useApi() { const auth = useAuth(); return async (url: string) => { const token = auth.user?.access_token; const res = await fetch(url, { headers: token ? { Authorization: `Bearer ${token}` } : {}, }); return res.json(); }; } Token Storage: What to Do (And What to Avoid) For SPAs, storing tokens in localStorage is convenient but increases risk if an XSS bug exists. Preferred options: In-memory storage (default in many libs)Use httpOnly secure cookies (requires a backend “BFF” pattern) If your app is high-security (healthcare/finance), strongly consider the BFF pattern. Common Pitfalls (And Quick Fixes) “Invalid redirect_uri” The redirect URI must match exactly what you configured in the IdP. CORS issues calling your API Your API must allow your SPA origin and accept Authorization header. Logout doesn’t fully log out Some IdPs require an id_token_hint or specific end-session endpoint (the OIDC library usually handles this if discovery is correct). Silent refresh fails Check allowed iframe origins / third-party cookies, and consider refresh-token rotation or short sessions. Production Checklist Use https in production (no exceptions)Turn off debug loggingValidate issuer and discovery docUse least-privilege scopesProtect routes + secure APIs with JWT validationConsider BFF for sensitive apps

By Renjith Kathalikkattil Ravindran

Hadoop on AmpereOne Reference Architecture

Ampere processors with Arm architecture deliver superior power efficiency and cost advantages compared to traditional x86 architecture. Hadoop, with its core components and broader ecosystem, is fully compatible with Arm-based platforms. Ampere Computing has previously published a comprehensive reference architecture demonstrating Hadoop deployments on Ampere® Altra® M processors. This paper builds on that foundation and extends the analysis by highlighting Hadoop performance on the next generation of AmpereOne® M processor. Scope and Audience The scope of this document includes setting up, tuning, and evaluating the performance of Hadoop on a testbed with AmpereOne® M processors. This document also compares the performance benefits of the 12-Channel AmpereOne® M processors with the previous generation of Ampere Altra — the 128C (Ampere Altra 128 Cores) processors. In addition, the document evaluates the use of 64k page-size kernels and highlights the resulting performance improvements compared to traditional 4KB page-size kernels. The document provides step-by-step guidance for installing and tuning Hadoop on single- and multi-node clusters. These recommendations serve as general guidelines for cluster configuration, and parameters can be further optimized for particular workloads and use cases. This document is intended for a diverse audience, including sales engineers, IT and cloud architects, IT and cloud managers, and customers seeking to leverage the performance and power efficiency benefits of Ampere Arm servers in their data centers. It aims to provide valuable insights and technical guidance to these professionals who are interested in implementing Arm-based Hadoop solutions and optimizing their infrastructure. AmpereOne® M Processors AmpereOne® M processor is part of the AmpereOne® family of high-performance server-class processors, engineered to deliver exceptional performance for AI Compute and a broad spectrum of mainstream data center workloads. Data-intensive applications such as Hadoop and Apache Spark benefit directly from the processor’s 12 DDR5 memory channels, which provide the bandwidth required for large-scale data processing. AmpereOne® M processors introduce a new platform architecture featuring higher core counts and additional memory channels, distinguishing it from Ampere’s previous platforms while preserving Ampere’s Cloud Native design principles. AmpereOne® M was designed from the ground up for cloud efficiency and predictable scaling. Each vCPU maps one-to-one with a physical core, ensuring consistent performance without resource contention. With up to 192 single-threaded cores and twelve DDR5 channels delivering 5600 MT/s, AmpereOne® M sustains the throughput required for demanding workloads, ranging from large language model (LLM) inference to real-time analytics. In addition, AmpereOne® M delivers exceptional performance-per-watt, reducing operational costs, energy consumption, and cooling requirements, making it well-suited for sustainable, high-density data center deployments. Hadoop on Ampere Processors There has been a significant shift towards the adoption of Arm-based processors in data centers over the past several years. Arm-based processors are increasingly used for distributed computing and offer compelling advantages for Hadoop deployments, a few of which are discussed in this paper. The Hadoop ecosystem is written in Java and runs seamlessly on Arm processors. Most of the Linux distributions, file systems, and open-source tools commonly used with Hadoop provide native Arm support. As a result, migrating existing Hadoop clusters (brownfield deployments) or deploying new clusters (greenfield deployments) on Arm-based infrastructure can be accomplished with little to no disruption. Running Hadoop’s distributed processing framework on energy-efficient Ampere processors represents an important evolution in big data infrastructure. This approach enables more sustainable, power-efficient, and cost-effective Hadoop deployments while maintaining performance and scalability Big Data Architecture The scale, complexity, and unstructured nature of modern data generation exceed the capabilities of traditional software systems. Big data applications are purpose-built to manage and analyze these complex datasets. Big data is defined not only by Volume but also by the Velocity at which the data is generated and processed, the variety of formats it spans (from structured numerical data to unstructured text, images, and video), and its Veracity (the quality and accuracy of the data) and the Value it delivers. Together, these characteristics create both significant challenges and unprecedented opportunities for insight and innovation. Big data includes structured, semi-structured, and unstructured data that is analyzed using advanced analytics. Typical big data deployments operate at a terabyte and petabyte scale, with data continuously created and collected over time. The big data domain includes data ingestion, processing, and analysis of datasets that are too large, fast-moving, or complex for traditional data processing systems. Sources of big data are limitless, and include Internet of Things (IoT) sensors, social media activity, e-commerce transactions, satellite imagery, scientific instruments, web logs, and more. The true power of big data is realized by extracting meaningful insights from this diverse and often unstructured information. By applying advanced analytics like artificial intelligence (AI) and machine learning (ML), organizations can predict trends, gain a deeper understanding of customer behavior and market dynamics, and identify operational inefficiencies at scale. Big data solutions involve the following types of workloads: Batch processing of big data sources at restReal-time processing of big data in motionInteractive exploration of big dataPredictive analytics and machine learning Hadoop Ecosystem The Apache Hadoop software library facilitates scalable, fault-tolerant, distributed computing by providing a framework for processing large volumes of data across commodity hardware clusters. Designed to scale from single-node deployments to thousands of machines, Hadoop distributes both storage through Hadoop Distributed File System (HDFS) and computation via MapReduce and YARN. Hadoop incorporates built-in fault tolerance to handle common node failures in large clusters. Through resilient software techniques such as data replication, the platform maintains high availability and ensures continuous data processing, even during infrastructure failure. By leveraging distributed computing and a resilient data management framework, Hadoop enables efficient processing and analysis of massive datasets. The platform supports a wide spectrum of data-intensive workloads, including data analytics, data mining, and machine learning, providing organizations with the required scalability, reliability, and performance required for complex data processing at scale. The four main elements of the ecosystem are Hadoop Distributed File System (HDFS), MapReduce, Yet Another Resource Negotiator (YARN), and Hadoop Common. Hadoop Distributed File System (HDFS) As the primary storage layer of Hadoop, HDFS manages datasets across distributed nodes. Its architecture ensures high scalability and fault tolerance through data replication and redundancy. HDFS divides data into fixed-size blocks and distributes them across the cluster, optimizing the system for parallel processing and high-throughput data access. MapReduce MapReduce is a programming model and processing framework for distributed data processing within the Hadoop ecosystem. It enables parallel execution by dividing workloads into smaller tasks that are distributed across cluster nodes. The Map phase processes data in parallel, and the Reduce phase aggregates and summarizes the results. MapReduce is commonly used for batch processing and large-scale data analytics workloads. Yet Another Resource Negotiator (YARN) YARN is a cluster of resource management software within the Hadoop ecosystem. It is responsible for resource allocation, scheduling, and workload coordination across the cluster. YARN enables multiple processing frameworks, such as MapReduce, Apache Spark, and Apache Flink, to run concurrently on the same infrastructure, allowing diverse workloads to efficiently share cluster resources. Hadoop Common Hadoop Common is a foundational component of the Hadoop ecosystem, providing shared libraries and utilities for all Hadoop modules to operate. It delivers core services including authentication, security protocols, and file system interfaces, ensuring consistency and interoperability across the ecosystem’s components. Hadoop Common has officially supported ARM-based platforms since version 3.3.0, including native libraries optimized for the Arm architecture. This support enables seamless deployment and operation of Hadoop on modern Arm-based infrastructure. Figure 1 Hadoop Test Bed A 3-node cluster was set up for performance benchmarking. The cluster was set up with AmpereOne® M processors. Equipment Under Test Cluster Nodes: 3CPU: AmpereOne® MSockets/Node: 1Cores/Socket: 192Threads/Socket: 192CPU Speed: 3200 MHzMemory Channels: 12Memory/Node: 768GBNetwork Card/Node: 1 x Mellanox ConnectX-6Storage/Node: 4 x Micron 7450 Gen 4 NVMEKernel Version: 6.8.0-85: Ubuntu 24.04.3Hadoop Version: 3.3.6JDK Version: JDK 11 Hadoop Installation and Cluster Setup OS Install. The majority of modern open-source and enterprise-supported Linux distributions offer full support for the AArch64 architecture. To install your chosen operating system, use the server for a Kernel-based virtual machine (KVM) console to map or attach the OS installation media, and then follow the standard installation procedure. Networking Setup. Set up a public network on one of the available interfaces for client communication. This can be used to log in to any of the servers where client communication is needed. Set up a private network for communication between the cluster nodes. Storage Setup. Choose a drive of your choice for OS installation, clear any old partitions, reformat, and choose the disk to install the OS. A Samsung 960 GB drive (M.2) was chosen for the OS installation in this setup. Add additional high-speed NVMe drives for the HDFS file system. Create Hadoop. User: Create a user named “hadoop” as part of the OS Installation and provide necessary sudo privileges for the user. Post-Install Steps: Perform the following post-installation steps on all the nodes after the OS installation. yum or apt update on the nodes.Install packages like dstat, net-tools, nvme-cli, lm-sensors, linux-tools-generic, python, and sysstat for your monitoring needs.Set up ssh trust between all the nodes.Update /etc/sudoers file for nopasswd for hadoop user.Update /etc/security/limits.conf per Appendix.Update /etc/sysctl.conf per Appendix.Update the scaling governor to performance and disable transparent hugepages per the Appendix.If necessary, make changes to /etc/rc.d to keep the above changes permanent after every reboot.Set up NVMe disks as an XFS file system for HDFS. Zap and format the NVME disks.Create a single partition on each of the nvme disks with fdisk or parted.Create file system on each of the created partitions as mkfs.xfs -f /dev/nvme[0-n]n1p1.Create directories for mounting on root.mkdir -p /root/nvme[0-n]1p1.Update /etc/fstab with entries to mount the file system. The UUID of each partition for update in fstab can be extracted from the blkid command.Change ownership of these directories to the ‘hadoop’ user created earlier. Hadoop Install Download Hadoop 3.3.6 from the Apache website and JDK11 for Arm/Aarch64. Extract the tarballs under the Hadoop home directory. Update the Hadoop configuration files in ~/hadoop/etc/hadoop/ and the environment parameters in .bashrc per the Appendix. Depending on the hardware specifications of cores, memory, and disk capacities, these parameters may have to be altered. Update the workers' file to include the set of data nodes. Run the following commands Shell hdfs namenode -format scp -r ~/hadoop <datanodes>:~/hadoop ~/hadoop/sbin/start-all.sh This should start with the NodeManager, ResourceManager, NameNode, and DataNode processes on the nodes. Please note that NameNode and Resource Managers are started only on the master node. Verification of the setup: Run the jps command on each node to check the status of the Hadoop daemons.Verify that -ls, -put, -du, -mkdir commands can be run on the cluster. Performance Tuning Hadoop is a complex framework where many components interact across multiple systems. Overall performance is influenced by several distinct factors: Platform settings: This includes configurations at the hardware and operating system levels, such as BIOS settings, specific OS parameters, and the performance of network and disk subsystems.Hadoop configuration: The configuration of the Hadoop software stack itself also plays a critical role in efficiency. Optimizing these settings typically requires prior experience with Hadoop. It is important to approach performance tuning as an iterative process. It is important to note that performance tuning is an iterative process, and the parameters provided in the Appendix are merely reference values obtained through a few iterations. Linux: Occasionally, conflicts between different subcomponents of a Linux system, such as the networking and disk subsystems, can arise and negatively impact overall performance. The primary objective is to optimize the entire system to achieve optimal disk and network throughput by identifying and resolving any bottlenecks that may emerge during operation. Network: To evaluate the underlying network infrastructure, the iperf utility can be used to conduct stress tests. Performance optimization involves adjusting specific driver parameters, such as the Transmit (TX) and Receive (RX) ring buffers and the number of interrupt queues, to align them with the CPU cores on the Non-Uniform Memory Access (NUMA) node where the Network Interface Card (NIC) resides. However, if the system's BIOS is already configured in monolithic mode, these specific kernel-level modifications related to NUMA alignment may not be necessary. Disks: When optimizing performance in a Hadoop environment, administrators should focus on specific disk subsystem parameters: Aligned partitions: Partitions should be aligned with the storage's physical block boundaries to maximize I/O efficiency. Utilities like parted can be used to create aligned partitions. I/O queue settings: Parameters such as the queue depth and nr_requests (number of requests) can be fine-tuned via the /sys/block//queue/ directory paths to control how many I/O operations the kernel schedules for a storage device. Filesystem mount options: Utilizing the noatime option in the /etc/fstab file is critical for Hadoop, as it prevents unnecessary disk writes by disabling the recording of file access timestamps. The fio (flexible I/O tester) tool is highly effective for benchmarking and validating the performance of the disk subsystem after these changes are implemented. HDFS, YARN, and MapReduce HDFS In HDFS, the primary parameters to consider for data management and resilience are the block size and replication factor. By default, the HDFS block size is 128 MB. Files are divided into chunks matching this size, which are then distributed across different data nodes. In certain high-performance environments or test beds, a larger block size, such as 512 MB, might be used to optimize throughput for large files. The test bed with the AmpereOne® M processor was also set up with 512MB. The replication factor (defaulting to 3) determines data redundancy. When an application writes data once, HDFS replicates those blocks across the cluster based on this factor, ensuring three identical copies are available for high availability and fault tolerance. Consequently, the total storage space required is directly proportional to the replication factor used (a factor of 3 means you need 3x the raw data size in storage capacity). HDFS 3.x introduced Erasure Coding (EC) as an alternative to traditional replication. EC significantly reduces storage overhead; for example, a 6+3 EC configuration provides data redundancy comparable to a 3x replication factor but uses substantially less physical storage space. It is important to note, however, that while EC saves storage, it introduces additional computational and network load compared to simple replication. In the described test bed environment, a standard replication factor of 1 was employed YARN YARN (Yet Another Resource Negotiator) is the resource management framework within the Hadoop ecosystem. It offers two main scheduler options: the Fair scheduler and the Capacity scheduler. The Fair scheduler (the default configuration) distributes available cluster resources evenly and dynamically among all running applications or jobs over time. The Capacity scheduler allocates a guaranteed, fixed capacity to each queue, user, or job. By default, the behavior of standard configurations is that if a queue does not fully utilize its reserved capacity, that excess may remain unused or might be conditionally shared depending on specific configuration parameters. Key configuration settings for either scheduler involve defining the limits for resource allocation, specifically the minimum allocation, maximum allocation, and incremental "stepping" values for both memory and virtual CPU cores (vcores). We used the default configuration in the testing environment. MapReduce In the MapReduce framework, a job is broken down into numerous smaller tasks, where each task is designed to have a smaller memory footprint and leverage a single or fewer virtual cores (vcores). Resource allocation within YARN is determined by these task requirements, considering the total memory available to the YARN Node Manager and the total number of vcores it manages. These configurations can be directly adjusted within the yarn-site.xml file. Reference parameters used in a specific test bed are often provided in an Appendix for guidance. Benchmark Tools We used the HiBench benchmarking tool. HiBench is a popular benchmarking suite specifically designed for evaluating the performance of Big Data frameworks, such as Apache Hadoop and Apache Spark. It consists of a set of workload-specific benchmarks that simulate real-world Big Data processing scenarios. For additional information, you can refer to this link. By running HiBench on the cluster, you can assess and compare its performance in handling various Big Data workloads. The benchmark results can provide insights into factors such as data processing speed, scalability, and resource utilization for each cluster. Steps to run HiBench on the cluster: Download HiBench software from the link above.Update hibench.conf file, like scale, profile, parallelism parameters, and a list of master and slave nodes.Run ~HiBench/bin/workloads/micro/terasort/prepare/prepare.sh.Run ~HiBench/bin/workloads/micro/terasort/Hadoop/run.sh. The above will generate a hibench.report file under the report directory. Further, a bench.log file provides details of the run. The cluster was using a data set of 3 TB. We measured the total power consumed, CPU power, CPU utilization, and other parameters like disk and network utilization using Grafana and IPMI tools. Throughput from the HiBench run was calculated for TeraSort in the following scenarios: Hadoop running on a single node on AmpereOne® M to compare with the previous generation of Ampere Altra – 128c.Hadoop running on a single node on AmpereOne® M to compare with a 3-node cluster of AmpereOne® M to measure the scalability.Hadoop running on a 3-node cluster with 64k page size on AmpereOne® M to compare it with 4k page size on the same processor. Performance Tests on AmpereOne® M Cluster TeraSort Performance Figures 2 and 3 Using the Hibench tool as mentioned above, we ran Hadoop TeraSort tests on one, two, and three nodes with AmpereOne® M processors and compared the values we got earlier on Ampere Altra – 128C. From Figure 2, it is evident that there is a 40% benefit of AmpereOne® M over Ampere Altra – 128C while running Hadoop TeraSort. This increase in performance can be attributed to a newer microarchitecture design, an increase in core count (from 128 to 192), and the 12-channel DDR5 design on AmpereOne® M. Near-linear scalability was observed when running TeraSort. The output for the 3x nodes configuration was found to be very close to three times the output of a single node. 64k Page Size Figure 4 We observed a significant performance increase, approximately 30%, with 64k page size on the Arm architecture while running the Hadoop TeraSort benchmark. Most modern Linux distributions, support largemem kernels natively. For other systems, building a custom 64k page size kernel is a straightforward procedure that can be implemented with a standard reboot. We have not observed any issues while running Hadoop TeraSort benchmarks on largemem kernels. Performance per Watt on AmpereOne® M Figure 5 To evaluate the energy efficiency of the cluster, we computed the Performance-per-Watt (Perf/Watt) ratio. This metric is derived by dividing the cluster's measured throughput (megabytes per second) by its total power consumption (watts) during the benchmarking interval. In these assessments, we observed AmpereOne® M performing 30% better over its predecessor on the Hadoop TeraSort benchmark. OS Metrics While Running the Benchmark Figure 6 The above image is a snapshot from the Grafana dashboard captured while running the benchmark. The systems achieved maximum CPU utilization while running the TeraSort benchmark using HiBench. We observed disk read/write activity of approximately 10 GB/s and network throughput of 30 GB/s. Since both observed I/O and network throughput were significantly below the cluster's scalable limits, the results confirm that the benchmark successfully pushed the CPUs to their maximum capacity. We observed from the above graphs that AmpereOne® M not only drove disk and network I/O higher than Ampere Altra – 128C, but also completed tasks considerably faster Power Consumption Figure 7 The graph illustrates the power consumption of cluster nodes, the platform, and the CPU. The power was measured by the IPMI tool during the benchmark run. The data reveals that the AmpereOne® M cluster consumed more absolute power than the Ampere Altra – 128C. However, this increased power usage correlated with a higher TeraSort throughput on the AmpereOne® M system. AmpereOne® M cluster delivers a better performance per watt (Figure 5). Conclusions This paper presents a reference architecture for deploying Hadoop on a multi-node cluster powered by AmpereOne® M processors and compares the results against a prior deployment on Ampere Altra – 128C processors. The latest TeraSort benchmark results validate the findings of earlier studies, demonstrating that Arm-based processors provide a compelling, high-performance alternative to traditional x86 systems for big-data workloads. Building on this foundation, the evaluation of the 12‑channel DDR5 AmpereOne® M platform shows measurable improvements not only in raw throughput but also in performance-per-watt compared to previous generation processors. The improvements confirm that the AmpereOne® M is a purpose-built platform designed for modern data centers and enterprises that prioritize both performance and energy efficiency. AmpereOne® M addresses the core requirements of today’s organizations: performance, efficiency, and scalability. Big Data workloads demand significant compute capacity and persistent storage, and by deploying these applications on Ampere processors, organizations benefit from both scale-up and scale-out architectures. This approach enables a higher density per rack, reduces power consumption, and delivers consistent throughput at scale. To learn more about our developer efforts and find best practices, visit Ampere’s Developer Center and join the conversation in the Ampere Developer Community. Appendix /etc/sysctl.conf Shell kernel.pid_max = 4194303 fs.aio-max-nr = 1048576 net.ipv4.conf.default.rp_filter=1 net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack = 1 net.core.netdev_max_backlog = 25000 net.core.rmem_max = 2147483647 net.core.wmem_max = 2147483647 net.core.rmem_default = 33554431 net.core.wmem_default = 33554432 net.core.optmem_max = 40960 net.ipv4.tcp_rmem =8192 33554432 2147483647 net.ipv4.tcp_wmem =8192 33554432 2147483647 net.ipv4.tcp_low_latency=1 net.ipv4.tcp_adv_win_scale=1 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv4.conf.all.arp_filter=1 net.ipv4.tcp_retries2=5 net.ipv6.conf.lo.disable_ipv6 = 1 net.core.somaxconn = 65535 #memory cache settings vm.swappiness=1 vm.overcommit_memory=0 vm.dirty_background_ratio=2 /etc/security/limits.conf Shell * soft nofile 65536 * hard nofile 65536 * soft nproc 65536 * hard nproc 65536 Miscellaneous Kernel changes Shell #Disable Transparent Huge Page defrag echo never> /sys/kernel/mm/transparent_hugepage/defrag echo never > /sys/kernel/mm/transparent_hugepage/enabled #MTU 9000 for 100Gb Private interface and CPU governor on performance mode ifconfig enP6p1s0np0 mtu 9000 up cpupower frequency-set --governor performance .bashrc file Shell export JAVA_HOME=/home/hadoop/jdk export JRE_HOME=$JAVA_HOME/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$classpath export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin #HADOOP_HOME export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH export PATH=$PATH:/home/hadoop/.local/bin core-site.xml XML <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://<server1>:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>io.native.lib.available</name> <value>true</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec, com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>io.compression.codec.snappy.class</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> </configuration> hdfs-site.xml XML <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.blocksize</name> <value>536870912</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/hadoop/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/data/data1/hadoop, /data/data2/hadoop, /data/data3/hadoop, /data/data4/hadoop </value> </property> <property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/lib/hadoop-hdfs/dn_socket</value> </property> </configuration> yarn-site.xml XML <configuration>  <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value><server1></value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>81920</value> </property> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>186</value> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>737280</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>186</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration> mapred-site.xml XML <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME, LD_LIBRARY_PATH=$LD_LIBRARY_PATH </value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value> </property> <property> <name>mapreduce.application.classpath</name> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib-examples/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/sources/*, $HADOOP_MAPRED_HOME/share/hadoop/common/*, $HADOOP_MAPRED_HOME/share/hadoop/common/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/*, $HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/*, $HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value><server1>:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value><server1>:19888</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.map.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.reduce.cpu.vcore</name> <value>1</value> </property> <property> <name>mapreduce.map.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx2g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value> -Djava.net.preferIPv4Stack=true -Xmx3g -XX:+UseParallelGC -XX:ParallelGCThreads=32 -Xlog:gc*:stdout</value> </property> <property> <name>mapreduce.task.timeout</name> <value>6000000</value> </property> <property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress</name> <value>true</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property> <property> <name>mapreduce.output.fileoutputformat.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>32</value> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>32</value> </property> </configuration> Check out the full Ampere article collection here.

By RamaKrishna Nishtala

Chat with Your Oracle Database: SQLcl MCP + GitHub Copilot

Ask questions in plain English inside VS Code. Get SQL results back instantly — no copy-pasting, no switching tools. The Problem: Too Many Switches If you work with Oracle databases, you know the drill: write SQL in a text editor, copy it to SQL Developer or SQLcl, run it, then copy results back. Add an AI assistant into the mix and you get another window — one that can write SQL but has no way to actually run it against your database. What if you could stay in VS Code, ask GitHub Copilot something like "Show me the top 10 customers by total order value", and see real results right in the chat? You can. Oracle’s SQLcl includes a built-in Model Context Protocol (MCP) server, and GitHub Copilot in VS Code can talk to it. This article walks you through the setup and shows you how to query your Oracle database using natural language. What You Need to Know MCP in 30 Seconds Model Context Protocol (MCP) is an open standard from Anthropic that lets AI assistants call external tools in a structured way. Think of it as a plug-and-play layer: the AI sends a request like "run this SQL", and a tool on your machine executes it and returns the result. The AI never talks to the database directly — it talks to the tool, and the tool talks to the database. Plain Text You (natural language) → Copilot (understands, writes SQL) → MCP → SQLcl → Oracle → results back SQLcl’s Role SQLcl is Oracle’s modern command-line interface for Oracle Database. It runs on your machine and can connect to Oracle Autonomous Database (or any Oracle DB) via JDBC. Starting with SQLcl 24.3, it ships with an embedded MCP server. When you run sql mcp, SQLcl listens on standard input/output — no separate server process, no ports to open. Prerequisites Oracle Cloud Free account — oracle.com/cloud/freeOracle Autonomous Database (Always Free tier)SQLcl installed (brew install sqlcl on macOS)SQL Developer (for saving a named connection once)VS Code with GitHub Copilot (MCP support is generally available as of VS Code 1.102) Step 1: Create and Configure Oracle Autonomous Database 1. In the OCI Console, go to Oracle Database → Autonomous Database and click Create Autonomous Database. 2. Set Always Free to ON, choose a database name (e.g. myappdb), and set an admin password. 3. After creation, go to Database connection → Download wallet. Extract the zip to a folder (e.g. ~/Downloads/Wallet_myappdb/). 4. Fix the wallet path in sqlnet.ora (OCI uses ? as a placeholder): Plain Text sed -i '' 's|DIRECTORY="?/network/admin"|DIRECTORY="/path/to/your/wallet"|' \ /path/to/your/wallet/sqlnet.ora 5. Whitelist your IP if needed: Network → Access Control List → Add your public IP (e.g. 203.0.113.10/32). Step 2: Install SQLcl and Verify the Connection Plain Text # macOS (Homebrew) brew install sqlcl # Verify sql -v Test the connection (replace paths and credentials): Plain Text TNS_ADMIN=~/Downloads/Wallet_myappdb \ sql admin/<password>@myappdb_high You should see Connected to: and a SQL> prompt. Type exit to leave. Step 3: Save the Connection in SQL Developer SQLcl and SQL Developer share a connection registry. Copilot will use connections by name, so you need to save it once: Open SQL Developer.Create a New Connection → Cloud Wallet.Point to your wallet folder, enter username/password, choose the service (e.g. myappdb_high).Name the connection (e.g. myappdb) and Test → Save. The connection is stored in ~/.sqldeveloper/system*/o.jdeveloper.db.connection*/connections.xml. Copilot will look it up by that name — no credentials in chat. Step 4: Add SQLcl MCP to VS Code VS Code (and Copilot) use MCP servers defined in settings. Add SQLcl: Option A: User settings Open Settings (Cmd+, / Ctrl+,).Search for MCP or go to GitHub Copilot → MCP servers.Add the SQLcl server, or edit settings.json directly: Plain Text { "mcp": { "servers": { "sqlcl": { "type": "stdio", "command": "sql", "args": ["mcp"] } } } } Option B: Workspace config (.vscode/mcp.json) For team projects, create or edit .vscode/mcp.json: Plain Text { "mcpServers": { "sqlcl": { "command": "sql", "args": ["mcp"] } } } Restart VS Code so the MCP server is loaded. Step 5: Chat with Your Database Open GitHub Copilot Chat (Cmd+Shift+I / Ctrl+Shift+I).Ask in plain English, for example: "Connect to myappdb and show me all tables""What columns does the ORDERS table have?""Find the top 10 customers by total order value""How many orders per month in the last year?" Copilot will use the SQLcl MCP tools to connect, run SQL, and display results. What Happens Under the Hood When you ask a question: Copilot understands the intent and decides which MCP tool to use (run-sql, schema-information, etc.).Copilot generates the SQL.MCP sends a JSON-RPC message to SQLcl over stdio: { "tool": "run-sql", "params": { "connection": "myappdb", "sql": "..." } }.SQLcl looks up the saved connection, opens a JDBC session, runs the SQL against Oracle.Oracle returns rows; SQLcl formats them as JSON and sends them back through MCP.Copilot shows the results in the chat. Plain Text ┌─────────────────────────────────────────────────────────┐ │ VS Code + GitHub Copilot Chat │ └─────────────────────┬───────────────────────────────────┘ │ MCP (JSON-RPC over stdio) ▼ ┌─────────────────────────────────────────────────────────┐ │ SQLcl MCP Server (sql mcp) │ │ Tools: connect, run-sql, schema-information, etc. │ └─────────────────────┬───────────────────────────────────┘ │ JDBC / TLS ▼ ┌─────────────────────────────────────────────────────────┐ │ Oracle Autonomous Database │ └─────────────────────────────────────────────────────────┘ Example Prompts You Can Try GoalExample promptExplore schema"List all tables in myappdb"Inspect table"Describe the CUSTOMER table"Aggregate data"Sum of order totals by customer"Filter and sort"Top 5 orders by price"Time series"Monthly order counts for 2024"DDL"Generate CREATE TABLE for ORDERS" Large Results: Use run-sql-async For queries that return many rows (e.g. full table scans on large tables), the standard run-sql tool can hit Copilot’s context limits. SQLcl provides run-sql-async, which submits the query, polls for completion, and fetches only the final result. Copilot will use this pattern when appropriate; you don’t need to change anything — just be aware that for big result sets, the flow may take a bit longer. Security in a Nutshell ConcernHow it’s handledCredentialsStored in SQL Developer’s connection store, not in chat or configNetworkWallet provides mutual TLS; traffic is encryptedAccess controlADB ACL restricts which IPs can connectDestructive operationsCopilot typically asks for confirmation before DDL/DML Troubleshooting "SQLcl not found" — Ensure sql is on your PATH (which sql). Add SQLcl’s bin directory to PATH if needed."Connection refused" / ORA-12506 — Add your IP to the ADB Access Control List."Wallet / SSL error" — Fix the DIRECTORY path in sqlnet.ora as in Step 1.MCP server not showing — Restart VS Code after editing MCP config. Ensure Copilot has permission to use MCP servers (some org policies disable it). Summary MCP gives Copilot a standard way to call tools (like SQLcl) without custom integrations.SQLcl runs on your machine and talks to Oracle; sql mcp exposes an MCP server over stdio.Copilot understands natural language, generates SQL, and uses MCP to run it and show results. Once you’ve set up the wallet, saved the connection in SQL Developer, and added SQLcl to VS Code’s MCP config, you can query your Oracle database in plain English from Copilot Chat — no more juggling windows. Further Reading Oracle SQLcl — Download and documentationModel Context Protocol — MCP specificationGitHub Copilot + MCP in VS Code — Official setup guide

By Sanjay Mishra

Coding

Functions of Coding

Frameworks

Java

JavaScript

Languages

Tools

DZone's Featured Coding Resources

The Latest Coding Topics