CUDA Tutorial

CUDA (Compute Unified Device Architecture) is a GPU computing platform and programming model from NVIDIA that exposes hardware-level parallel execution capabilities to software.

Provides direct access to GPU cores for general-purpose parallel computation.
Executes kernels using thousands of concurrent threads organized into blocks and grids.
Offloads data-parallel workloads from the CPU to the GPU for higher throughput.
Commonly used in AI training, deep learning inference and high-performance computing workloads.

Introduction

This section explains the physical limits of CPUs that led to the rise of GPUs and helps you set up a free development environment in the cloud.

Basics & Syntax

This section defines the core CUDA C++ execution model and language-level constructs used to declare device code and launch GPU kernels from host programs.

Threads & Memory Management

Describes CUDA's core concepts of thread hierarchy and device memory model, focusing on how work is indexed, distributed, and mapped to GPU hardware resources.

Thread Hierarchy: Threads, Blocks & Grids
Calculating Global Thread IDs
The Hardware Model: SMs vs. SPs
Host vs. Device Memory
Unified Memory (Managed)

Performance Optimization

This section covers memory-access patterns, on-chip memory usage, and transfer strategies required to maximize kernel throughput and minimize latency bottlenecks.

Memory Hierarchy Overview
Shared Memory & Bank Conflicts
Memory Coalescing
Pinned Memory (Page-Locked)

Synchronization & Atomics

How to make thousands of threads work together without crashing or overwriting each other's data under parallel write/read conditions.

Thread Safety & __syncthreads
Atomic Operations
CUDA Streams & Concurrency
Case Study: Parallel Reduction

Advanced Techniques & Libraries

This section introduces profiling tools, advanced execution features, and optimized CUDA libraries used for production-grade GPU applications.

Profiling with Nsight Systems
Dynamic Parallelism
Essential Libraries: Thrust & cuBLAS

Deep Learning & PyTorch Extensions

Explains how CUDA integrates with deep learning frameworks and how custom GPU kernels are exposed to Python via C++ extensions.

PyTorch Internals: How torch.cuda Works
Writing a C++ Extension for PyTorch
Project: Custom ReLU Kernel
Project: MNIST Inference Engine

Projects

Applies CUDA concepts to end-to-end implementations that demonstrate real parallel workload design and optimization.

Matrix Multiplication
Image Processing