TOPOLOGY-AWARE MULTI-GPU VM PLACEMENT

Architecting AI Infrastructure Series - Part 11 A multi-GPU VM isn’t only asking for multiple devices. It’s asking for a specific communication geometry. This distinction matters. When a platform team provisions a VM for LLM inference or fine-tuning, they’re not simply allocating two units of compute. They’re allocating two GPUs that can communicate at hundreds of gigabytes per second over NVLink. Two GPUs on the same server that must communicate via PCIe won’t deliver the same result.

UNDERSTANDING MULTI-GPU TOPOLOGIES WITHIN A SINGLE HOST

Architecting AI Infrastructure Series - Part 10 Part 9 covered why it’s important to understand the topology when using multiple GPUs. When a model runs across several GPUs, communication between them becomes part of the process. Not all GPUs in a server communicate at the same speed, and these differences can impact performance. Many AI teams prefer to run their workloads on a single server. This helps reduce network complexity and simplify deployment. Still, there are several ways to set up multiple GPUs in a single server.

UNDERSTANDING UNIFIED MEMORY ON DGX SPARK RUNNING NEMOCLAW AND NEMOTRON

NemoClaw became the talk of GTC 2026 within hours of its announcement. It wraps OpenClaw in NVIDIA’s OpenShell runtime, adds guardrails, and gives you an always on AI agent with a single install. Jensen Huang called OpenClaw the operating system for personal AI. NemoClaw is what makes that usable. This is part 4 of the AI Memory series and focuses on how memory behaves on real systems. I installed NemoClaw on a DGX Spark and ran Nemotron models locally to understand what actually happens in memory. The most important takeaway is simple. Unified memory breaks the usual GPU mental model.

WHY MULTI GPU REQUIRES TOPOLOGY AWARENESS

Architecting AI Infrastructure Series - Part 9 The AI Memory series has been showing how AI workloads use GPU memory in different ways. The Dynamic World of LLM Runtime Memory explains how the KV cache grows with each new token and becomes a main user of GPU resources. Understanding Activation Memory in Mixture of Experts Models looks at the hardware pressure that happens when activation memory spikes during the prefill phase. The series also covers how agentic systems keep memory active to stay on track during complex tasks, as discussed in Durable Agentic AI Sessions in GPU Memory.

DURABLE AGENTIC AI SESSIONS IN GPU MEMORY

The durable memory of agentic systems When a user asks a question in a chat interface and the model responds, the interaction is a single prompt completion. A prompt goes in, tokens come out. From an infrastructure perspective this is a predictable transaction. As described in The Dynamic World of LLM Runtime Memory, the KV cache grows with the prompt, peaks during generation, and is released when the session ends. The memory footprint is bounded and relatively easy to plan for.

MIG PARTITIONING, PLACEMENT GEOMETRY, AND STRANDED CAPACITY

Architecting AI Infrastructure — Part 8 Previous articles in this series explained how time-sliced GPU sharing works in both same-size and mixed-size environments. They showed that choices like profiles and the order in which workloads start can directly affect GPU utilization and whether workloads are placed successfully. In this part, we look at MIG and the design choices that affect placement success and overall resource utilization. MIG takes a different approach to GPU sharing. Instead of multiplexing compute resources between workloads, MIG splits the GPU into hardware instances. Each instance gets its own dedicated compute and memory slices slices.

SAME SIZE VS MIXED SIZE PLACEMENT AT CLUSTER SCALE

Architecting AI Infrastructure — Part 7 The Silo Capacity Visualizer from Part 6 shows how profile selection and placement-ID alignment affect memory layout inside a single GPU. While that’s helpful for understanding the basics, real capacity planning happens at the cluster level. This article introduces the Same-size vs Mixed-size Placement simulator, the second tool in the Cluster Profile Strategy Toolset. It lets you simulate vGPU placement across an entire cluster using both same-size and mixed-size policies simultaneously, with the same workload sequence for both. This way, you can directly compare their results.

MIXED SIZE VGPU MODE IN PRACTICE

Architecting AI Infrastructure - Part 6 Last time, I looked at how Same Size vGPU mode works with different assignment policies and how right-sizing profiles can make placement more flexible. The main point was that both profile variety and assignment choices have a big impact on how much GPU capacity you can actually use over time. Understanding Placement IDs and Siloed Capacity This article focuses on Mixed Size mode. Unlike locking a GPU to one profile after the first placement, Mixed Size lets you use different profile sizes on the same device. This might seem like an easy fix for fragmentation, but it brings a new challenge: placement IDs. These are fixed memory spots on the GPU where a profile can begin, so even if memory appears free, you can’t always use it unless it aligns with a valid placement spot. For more details on how placement IDs work, see Part 4.

HOW SAME SIZE VGPU MODE AND RIGHT-SIZING SHAPE GPU PLACEMENT EFFICIENCY

Architecting AI Infrastructure - Part 5 In the previous article, we looked at how GPUs are placed within an ESXi host and how GPU modes and assignment policies determine which physical GPU a workload uses. These decisions impact more than just the initial placement of workloads. They also shape how GPU capacity changes over time, affecting fragmentation, consolidation, and how easily new workloads can be scheduled. In this article, we will look at workloads that use fractional GPU profiles and how their sizing choices impact overall platform efficiency.

HOW VSPHERE GPU MODES AND ASSIGNMENT POLICIES DETERMINE HOST LEVEL PLACEMENT

Architecting AI Infrastructure - Part 4 In the last article, we tracked a GPU-backed VM from resource configuration to host selection. DRS evaluated the cluster, Assignable Hardware filtered hosts for GPU compatibility, DRS ran its Goodness calculation, and picked a destination host. Now, the host is selected. But the placement is not finished. Inside the host, another set of decisions decides which physical GPU gets the workload and what types of workloads that GPU will handle from then on. These host-level choices are less visible than DRS decisions. They do not show up in dashboards or trigger alerts. However, their effects add up over time, and they play a key role in keeping a shared AI platform healthy or letting it decline.