Back to Courses
AI021 Professional

CUDA Programming Guide

The official, comprehensive resource for developers to learn the CUDA programming model and how to write high-performance code that executes on NVIDIA GPUs. This guide covers the platform architecture, programming interface, advanced hardware features, and technical specifications.

5.0
30.0h
1762 students
1 likes
Artificial Intelligence
Start Learning

Lessons

Lesson

This lesson introduces the fundamental shift from latency-optimized CPU architectures to throughput-oriented GPU computing. Students will learn to distinguish between these processing models and understand how the CUDA programming platform enables massive parallel execution for data-intensive tasks.

This lesson introduces the fundamentals of CUDA kernel development, focusing on the SIMT execution model and the use of the __global__ specifier to launch parallel functions on GPU Streaming Multiprocessors. Students will learn how to manage asynchronous kernel execution, handle device memory, and structure code to ensure effective hardware utilization.

This lesson explores the fundamental differences between von Neumann and Harvard architectures, focusing on how memory access pathways impact computational performance. Students will learn to identify the von Neumann bottleneck, understand the benefits of split-cache Harvard designs, and analyze how modern systems utilize a Modified Harvard Architecture to balance throughput with programming flexibility.

AI021: Optimization, Graphs, and Hardware Accelerators (Lesson 4) explores the shift from CPU-bottlenecked stream execution to GPU-autonomous workflows. Students will learn to utilize modern primitives like CUDA Graphs, lazy loading, and asynchronous memory prefetching to minimize host-side overhead and maximize hardware efficiency.

This lesson explores the technical reference and language extensions in CUDA, focusing on the relationship between virtual architectures (PTX) and real hardware (SASS). Students will learn to manage compute capabilities, utilize architecture-specific macros, and navigate language constraints to ensure code portability and performance.

Course Overview

📚 Content Summary

The official, comprehensive resource for developers to learn the CUDA programming model and how to write high-performance code that executes on NVIDIA GPUs. This guide covers the platform architecture, programming interface, advanced hardware features, and technical specifications.

Master the art of parallel computing with the industry-standard guide to NVIDIA CUDA.

Author: NVIDIA Corporation

Acknowledgments: Copyright © 2007-2024 NVIDIA Corporation & affiliates. All rights reserved.

🎯 Learning Objectives

  1. Define the roles of the host (CPU) and device (GPU) within a heterogeneous system.
  2. Explain the SIMT programming model and the hierarchical organization of threads, blocks, and grids.
  3. Differentiate between PTX (Parallel Thread Execution) and binary code (cubins) and explain how Just-in-Time (JIT) compilation facilitates compatibility.
  4. Develop and Compile CUDA Kernels: Write global functions, configure execution with triple-chevron notation, and manage the NVCC compilation workflow.
  5. Optimize Memory and Data Movement: Distinguish between Unified, Explicit, and Mapped memory models, and implement page-locked host memory for efficient transfers.
  6. Manage Parallel Execution: Utilize CUDA Streams, Events, and Cooperative Groups to manage asynchronous tasks and synchronize CPU-GPU operations.
  7. Perform complex pointer arithmetic and identify architectural bottlenecks (von Neumann vs. Harvard).
  8. Implement advanced CUDA execution patterns, including Programmatic Dependent Kernel Launches and Heterogeneous Batched Memory Transfers.
  9. Utilize hardware-specific features like Thread Scopes, Asynchronous Proxies, and Pipelines to maximize concurrency.
  10. Configure and tune Unified Memory performance using prefetching, usage hints, and page size management.

Lessons