CUDA Programming Guide
The official, comprehensive resource for developers to learn the CUDA programming model and how to write high-performance code that executes on NVIDIA GPUs. This guide covers the platform architecture, programming interface, advanced hardware features, and technical specifications.
Lessons
Lesson
This lesson introduces the fundamental shift from latency-optimized CPU architectures to throughput-oriented GPU computing. Students will learn to distinguish between these processing models and understand how the CUDA programming platform enables massive parallel execution for data-intensive tasks.
This lesson introduces the fundamentals of CUDA kernel development, focusing on the SIMT execution model and the use of the __global__ specifier to launch parallel functions on GPU Streaming Multiprocessors. Students will learn how to manage asynchronous kernel execution, handle device memory, and structure code to ensure effective hardware utilization.
This lesson explores the fundamental differences between von Neumann and Harvard architectures, focusing on how memory access pathways impact computational performance. Students will learn to identify the von Neumann bottleneck, understand the benefits of split-cache Harvard designs, and analyze how modern systems utilize a Modified Harvard Architecture to balance throughput with programming flexibility.
AI021: Optimization, Graphs, and Hardware Accelerators (Lesson 4) explores the shift from CPU-bottlenecked stream execution to GPU-autonomous workflows. Students will learn to utilize modern primitives like CUDA Graphs, lazy loading, and asynchronous memory prefetching to minimize host-side overhead and maximize hardware efficiency.
This lesson explores the technical reference and language extensions in CUDA, focusing on the relationship between virtual architectures (PTX) and real hardware (SASS). Students will learn to manage compute capabilities, utilize architecture-specific macros, and navigate language constraints to ensure code portability and performance.
Course Overview
📚 Content Summary
The official, comprehensive resource for developers to learn the CUDA programming model and how to write high-performance code that executes on NVIDIA GPUs. This guide covers the platform architecture, programming interface, advanced hardware features, and technical specifications.
Master the art of parallel computing with the industry-standard guide to NVIDIA CUDA.
Author: NVIDIA Corporation
Acknowledgments: Copyright © 2007-2024 NVIDIA Corporation & affiliates. All rights reserved.
🎯 Learning Objectives
- Define the roles of the host (CPU) and device (GPU) within a heterogeneous system.
- Explain the SIMT programming model and the hierarchical organization of threads, blocks, and grids.
- Differentiate between PTX (Parallel Thread Execution) and binary code (cubins) and explain how Just-in-Time (JIT) compilation facilitates compatibility.
- Develop and Compile CUDA Kernels: Write global functions, configure execution with triple-chevron notation, and manage the NVCC compilation workflow.
- Optimize Memory and Data Movement: Distinguish between Unified, Explicit, and Mapped memory models, and implement page-locked host memory for efficient transfers.
- Manage Parallel Execution: Utilize CUDA Streams, Events, and Cooperative Groups to manage asynchronous tasks and synchronize CPU-GPU operations.
- Perform complex pointer arithmetic and identify architectural bottlenecks (von Neumann vs. Harvard).
- Implement advanced CUDA execution patterns, including Programmatic Dependent Kernel Launches and Heterogeneous Batched Memory Transfers.
- Utilize hardware-specific features like Thread Scopes, Asynchronous Proxies, and Pipelines to maximize concurrency.
- Configure and tune Unified Memory performance using prefetching, usage hints, and page size management.
Lessons
Overview: This lesson introduces the CUDA parallel computing platform and its underlying hardware architecture. It explores how heterogeneous systems utilize both CPUs and GPUs, the SIMT (Single Instruction, Multiple Threads) programming model, and the hierarchy of threads, blocks, and grids. Additionally, it covers the CUDA compilation workflow, including the roles of PTX, cubins, and fatbins in ensuring binary and forward compatibility.
Learning Outcomes:
- Define the roles of the host (CPU) and device (GPU) within a heterogeneous system.
- Explain the SIMT programming model and the hierarchical organization of threads, blocks, and grids.
- Differentiate between PTX (Parallel Thread Execution) and binary code (cubins) and explain how Just-in-Time (JIT) compilation facilitates compatibility.
Overview: This lesson covers the fundamental and advanced aspects of GPU programming using CUDA C++. It transitions from basic kernel specification and the NVCC compilation workflow to complex execution management topics, including SIMT kernel design, shared memory bank conflicts, and asynchronous execution using streams and events. Students will learn to balance memory models (Unified vs. Explicit) and optimize hardware occupancy for high-performance computing.
Learning Outcomes:
- Develop and Compile CUDA Kernels: Write global functions, configure execution with triple-chevron notation, and manage the NVCC compilation workflow.
- Optimize Memory and Data Movement: Distinguish between Unified, Explicit, and Mapped memory models, and implement page-locked host memory for efficient transfers.
- Manage Parallel Execution: Utilize CUDA Streams, Events, and Cooperative Groups to manage asynchronous tasks and synchronize CPU-GPU operations.
Overview: This lesson explores the transition from fundamental memory architectures and pointer logic to advanced GPU acceleration techniques. It covers the hardware-level execution models (SIMT, Independent Thread Scheduling), sophisticated synchronization mechanisms (Asynchronous Barriers, Scoped Atomics), and the orchestration of multi-GPU systems using both Runtime and Driver APIs.
Learning Outcomes:
- Perform complex pointer arithmetic and identify architectural bottlenecks (von Neumann vs. Harvard).
- Implement advanced CUDA execution patterns, including Programmatic Dependent Kernel Launches and Heterogeneous Batched Memory Transfers.
- Utilize hardware-specific features like Thread Scopes, Asynchronous Proxies, and Pipelines to maximize concurrency.
Overview: This lesson covers high-performance CUDA programming techniques, focusing on optimizing data movement and execution flow. It explores the transition from stream-based execution to persistent CUDA Graphs, the granular control of Unified Memory through prefetching and hints, and the utilization of hardware-specific accelerators like the Tensor Memory Accelerator (TMA) and L2 Cache persistence. Additionally, it details advanced synchronization patterns, resource partitioning via Green Contexts, and cross-API interoperability for modern heterogeneous computing.
Learning Outcomes:
- Configure and tune Unified Memory performance using prefetching, usage hints, and page size management.
- Construct, update, and execute CUDA Graphs, including the use of memory nodes and device-side launches.
- Implement advanced synchronization using Asynchronous Barriers and the Producer-Consumer pattern.
Overview: This lesson provides a deep technical dive into the CUDA programming model's reference specifications and C++ language extensions. It covers the hardware-software interface via compute capabilities, environment variables for runtime control, and the specific syntax requirements for writing high-performance device code using modern C++ standards, cooperative groups, and specialized hardware intrinsics.
Learning Outcomes:
- Identify hardware constraints and feature sets based on GPU Compute Capability versions.
- Configure the CUDA execution environment and JIT compilation using system-level environment variables.
- Apply C++ language extensions (annotations, lambdas, and templates) while adhering to device-side restrictions.