CUDA (Compute Unified Device Architecture) is a GPU computing platform and programming model from NVIDIA that exposes hardware-level parallel execution capabilities to software.
- Provides direct access to GPU cores for general-purpose parallel computation.
- Executes kernels using thousands of concurrent threads organized into blocks and grids.
- Offloads data-parallel workloads from the CPU to the GPU for higher throughput.
- Commonly used in AI training, deep learning inference and high-performance computing workloads.
Introduction
This section explains the physical limits of CPUs that led to the rise of GPUs and helps you set up a free development environment in the cloud.
- Sequential Processing & The Need for Parallelism
- Introduction to GPU Computing
- Introduction to CUDA
- Setting Up Google Colab for CUDA
- CUDA Installation and Setup in VS Code
- Compiling CUDA Programs (NVCC)
Basics & Syntax
This section defines the core CUDA C++ execution model and language-level constructs used to declare device code and launch GPU kernels from host programs.
- Writing First CUDA Program
- __host__ Specifier
- Launching a Kernel
- Passing Parameters & Error Handling
Threads & Memory Management
Describes CUDA's core concepts of thread hierarchy and device memory model, focusing on how work is indexed, distributed, and mapped to GPU hardware resources.
- Thread Hierarchy: Threads, Blocks & Grids
- Calculating Global Thread IDs
- The Hardware Model: SMs vs. SPs
- Host vs. Device Memory
- Unified Memory (Managed)
Performance Optimization
This section covers memory-access patterns, on-chip memory usage, and transfer strategies required to maximize kernel throughput and minimize latency bottlenecks.
- Memory Hierarchy Overview
- Shared Memory & Bank Conflicts
- Memory Coalescing
- Pinned Memory (Page-Locked)
Synchronization & Atomics
How to make thousands of threads work together without crashing or overwriting each other's data under parallel write/read conditions.
- Thread Safety & __syncthreads
- Atomic Operations
- CUDA Streams & Concurrency
- Case Study: Parallel Reduction
Advanced Techniques & Libraries
This section introduces profiling tools, advanced execution features, and optimized CUDA libraries used for production-grade GPU applications.
- Profiling with Nsight Systems
- Dynamic Parallelism
- Essential Libraries: Thrust & cuBLAS
Deep Learning & PyTorch Extensions
Explains how CUDA integrates with deep learning frameworks and how custom GPU kernels are exposed to Python via C++ extensions.
- PyTorch Internals: How torch.cuda Works
- Writing a C++ Extension for PyTorch
- Project: Custom ReLU Kernel
- Project: MNIST Inference Engine
Projects
Applies CUDA concepts to end-to-end implementations that demonstrate real parallel workload design and optimization.
- Matrix Multiplication
- Image Processing