Programming Massively Parallel Processors: A Hands-on Approach
This course provides a comprehensive introduction to GPU computing and parallel programming using the CUDA C environment. It covers GPU architectures, data parallelism, thread management, memory optimization, and advanced performance considerations, illustrated through real-world case studies like MRI reconstruction and molecular visualization.
Lessons
Lesson
This lesson explores the evolution of parallel computing, highlighting the "Great Divergence" where GPUs surpassed CPUs in performance by prioritizing throughput-oriented architecture over sequential latency. Students will learn to differentiate between these processing models, understand the impact of the "Power Wall" on CPU design, and analyze how GPU transistor budgeting enables massive parallel computation.
This lesson explores the evolution of GPU architecture, focusing on the "real-time imperative" that necessitated a shift from serial CPU processing to parallel hardware acceleration. Students will learn how early innovations like SLI and the "wide and slow" design philosophy enabled the high-throughput performance required to meet strict frame-time budgets in modern computing.
This lesson explores the CUDA execution model, focusing on the architectural differences between the latency-optimized CPU (Host) and the throughput-optimized GPU (Device). Students will learn how to manage the lifecycle of a CUDA kernel, implement memory allocation using cudaMalloc and cudaMemcpy, and organize threads into grids and blocks to perform parallel computations.
This lesson explores the fundamentals of CUDA kernel execution, focusing on the transition from CPU-based iteration to data-centric GPU parallelism. Students will learn to implement the global indexing formula, manage execution configurations for transparent scalability, and apply boundary guards to ensure safe memory access across multidimensional data.
This lesson explores the "Memory Wall" in GPU computing, where computational throughput outpaces memory bandwidth, creating a significant performance bottleneck. Students will learn to mitigate these constraints by implementing shared memory tiling strategies, optimizing data reuse, and managing hardware resource limits to maximize occupancy.
This lesson explores the SIMT execution model, focusing on how hardware organizes threads into 32-thread warps and linearizes them for efficient scheduling. Students will learn to evaluate performance through warp partitioning, branch divergence analysis, and memory access patterns to optimize GPU kernel utilization.
This lesson explores how Excess Encoding (biased representation) enables high-speed hardware sorting by ensuring that bit patterns maintain a monotonic relationship with their numerical values. By replacing the sign-bit discontinuity of Two's Complement with this biased format, engineers can utilize simple, efficient unsigned comparators to perform rapid operations like Z-buffering in parallel processors.
This lesson explores the computational challenges of non-Cartesian MRI reconstruction, where spiral trajectories require iterative solvers or gridding instead of standard Fast Fourier Transforms. Students will learn how to overcome these bottlenecks by leveraging GPU-based massive parallelism, specifically focusing on voxel-to-thread mapping to optimize reconstruction speed for time-sensitive clinical applications like Sodium MRI.
This lesson explores the use of Direct Coulomb Summation (DCS) and GPU acceleration to generate electrostatic potential maps for molecular visualization. Students will learn to optimize rendering pipelines through techniques like loop unrolling and constant memory broadcasting to efficiently handle large-scale atomic data.
This lesson explores the transition from sequential processing to parallel computing, emphasizing how computational thinking helps overcome the power wall and frequency limits. Students will learn to evaluate parallel algorithm performance, manage the trade-offs between numerical precision and execution speed, and apply problem decomposition to optimize distributed systems.
This lesson introduces the OpenCL framework as a solution for managing heterogeneous computing environments, where a host CPU orchestrates tasks across diverse accelerators like GPUs and FPGAs. Students will learn to utilize the OpenCL platform layer for hardware discovery, understand the device model's hierarchy, and implement portable, efficient kernels that adapt to different architectural requirements.
This lesson explores the evolution of GPU architecture from graphics-focused designs to the compute-first Fermi generation, which introduced unified memory hierarchies and IEEE 754-2008 compliance. Students will learn how these advancements, including hardware-managed caching and improved thread scheduling, enable complex scientific computing and general-purpose programming beyond traditional 2D grid tasks.
Course Overview
📚 Content Summary
This course provides a comprehensive introduction to GPU computing and parallel programming using the CUDA C environment. It covers GPU architectures, data parallelism, thread management, memory optimization, and advanced performance considerations, illustrated through real-world case studies like MRI reconstruction and molecular visualization.
Master the art of high-performance parallel computing with a practical, hands-on guide to CUDA and GPU architectures.
Author: David B. Kirk, Wen-mei W. Hwu
Acknowledgments: Ian Buck, John Nickolls, NVIDIA DevTech team, Jensen Huang, David Luebke, Bill Bean, Simon Green, Mark Harris, Manju Hedge, Nadeem Mohammad, Brent Oster, Peter Shirley, Eric Young, and Cyril Zeller.
🎯 Learning Objectives
- Distinguish between the design philosophies and performance trajectories of multicore CPUs and many-core GPUs.
- Identify the key components of a modern GPU architecture, including Streaming Multiprocessors (SMs) and memory structures.
- Apply Amdahl's Law to calculate theoretical speedup and identify the impact of sequential bottlenecks.
- Contrast the architectural differences between fixed-function pipelines and programmable unified processor arrays.
- Explain the role of "GPGPU" as an intermediate step and the restrictions of early shader programming models.
- Analyze how hardware features like atomic operations, barrier synchronization, and double-precision support enabled the transition to scalable general-purpose computing.
- Identify and exploit data parallelism within matrix-matrix multiplication algorithms.
- Implement device memory management including allocation, data transfer between host and device, and deallocation.
- Construct and launch CUDA kernels using appropriate thread indexing and grid/block configurations.
- Design multidimensional thread hierarchies (Grids and Blocks) to map complex data structures to GPU hardware.
Lessons
Overview: This lesson explores the fundamental shift from sequential to parallel computing, driven by the diverging design philosophies of CPUs and GPUs. Students will examine the "Multicore" versus "Many-core" trajectories, understand the hardware architecture that allows GPUs to achieve massive throughput, and learn the mathematical constraints of speedup via Amdahl's Law.
Learning Outcomes:
- Distinguish between the design philosophies and performance trajectories of multicore CPUs and many-core GPUs.
- Identify the key components of a modern GPU architecture, including Streaming Multiprocessors (SMs) and memory structures.
- Apply Amdahl's Law to calculate theoretical speedup and identify the impact of sequential bottlenecks.
Overview: This lesson traces the architectural journey of the Graphics Processing Unit (GPU) from its origins as a specialized fixed-function hardware for rendering triangles to its current state as a powerful, unified, general-purpose parallel processor. Students will explore the shift from rigid graphics pipelines to programmable shaders, the emergence of the GPGPU movement, and the modern scalable architectures that drive current scientific and engineering simulations.
Learning Outcomes:
- Contrast the architectural differences between fixed-function pipelines and programmable unified processor arrays.
- Explain the role of "GPGPU" as an intermediate step and the restrictions of early shader programming models.
- Analyze how hardware features like atomic operations, barrier synchronization, and double-precision support enabled the transition to scalable general-purpose computing.
Overview: This lesson covers the fundamental architecture of a CUDA program, emphasizing the distinction between Host (CPU) and Device (GPU) execution. Students will learn to identify data parallelism in matrix operations, manage separate memory spaces using the CUDA API, and organize parallel execution through a hierarchy of grids, blocks, and threads using the Single-Program, Multiple-Data (SPMD) style.
Learning Outcomes:
- Identify and exploit data parallelism within matrix-matrix multiplication algorithms.
- Implement device memory management including allocation, data transfer between host and device, and deallocation.
- Construct and launch CUDA kernels using appropriate thread indexing and grid/block configurations.
Overview: This lesson explores the hierarchical organization of threads in CUDA, focusing on how multidimensional indexing maps to physical data and hardware resources. It details the mechanisms of barrier synchronization and transparent scalability, concluding with the architectural principles of thread assignment and warp-based scheduling used to achieve latency tolerance in high-performance computing.
Learning Outcomes:
- Design multidimensional thread hierarchies (Grids and Blocks) to map complex data structures to GPU hardware.
- Implement precise data indexing using built-in CUDA variables (
blockIdx,threadIdx,blockDim). - Apply barrier synchronization to ensure data integrity while maintaining transparent scalability across different GPU architectures.
Overview: This lesson explores how memory bandwidth and resource constraints act as primary bottlenecks in parallel computing. It details the use of "tiling" to reduce global memory traffic and explains the critical role of synchronization barriers (__syncthreads()) and the strategic choice between registers and shared memory to optimize performance.
Learning Outcomes:
- Analyze how register and shared memory limits determine the level of parallelism (occupancy) in a kernel.
- Quantify the reduction in global memory bandwidth consumption achieved through tiling techniques.
- Identify the necessity of synchronization functions to maintain data integrity during shared memory access.
Overview: This lesson explores the architectural and algorithmic considerations essential for optimizing CUDA kernels. It transitions from basic execution models—specifically the Single-Instruction, Multiple-Thread (SIMT) unit and warp partitioning—to advanced performance tuning techniques including memory coalescing, tiled matrix multiplication, and the dynamic partitioning of Streaming Multiprocessor (SM) resources.
Learning Outcomes:
- Analyze the mapping of multi-dimensional thread blocks to the hardware’s linear warp execution order.
- Evaluate and minimize control flow divergence in parallel reduction algorithms.
- Optimize global memory bandwidth by implementing memory coalescing and tiled data access patterns.
Overview: This lesson covers the fundamental architecture of floating-point numbers, focusing on the IEEE 754 standard components: sign, excess-encoded exponent, and normalized mantissa. Students will explore how these bit patterns map to a discrete number line and how the limitations of this representation affect the accuracy of complex algorithms like large-scale summations.
Learning Outcomes:
- Deconstruct the floating-point format to calculate numeric values from bit patterns using normalized representation and excess encoding.
- Visualize the distribution of representable numbers on a number line and explain the impact of bit allocation between exponent and mantissa.
- Quantify numerical inaccuracy using ULP and identify how different rounding modes contribute to error.
Overview: This lesson explores the parallelization of advanced Magnetic Resonance Imaging (MRI) reconstruction on GPUs. It focuses on the iterative reconstruction process for non-Cartesian trajectories, specifically optimizing the computationally intensive F^H d kernel through loop transformations, constant memory management, data layout reorganization, and the use of hardware-accelerated trigonometric functions.
Learning Outcomes:
- Understand the transition from Cartesian FFT-based reconstruction to iterative linear-solver-based algorithms for non-Cartesian k-space data.
- Apply loop fission and loop interchange to transform sequential C code into a structure suitable for massive CUDA thread mapping.
- Optimize memory throughput using constant memory chunking and Array-of-Structs (AoS) data layouts.
Overview: This lesson explores the practical application of GPU computing to molecular visualization, specifically using the Direct Coulomb Summation (DCS) method to calculate electrostatic potential maps. Students will progress from a basic kernel implementation to highly optimized versions that leverage instruction unrolling, memory coalescing, and padding.
Learning Outcomes:
- Implement a Direct Coulomb Summation (DCS) kernel using CUDA constant memory and global memory latency-hiding techniques.
- Optimize kernel performance through instruction unrolling and the reuse of common coordinate calculations.
- Apply memory coalescing and padding strategies to align GPU global memory accesses for maximum bandwidth.
Overview: This lesson explores the transition from sequential thinking to parallel problem-solving by focusing on the goals of parallel programming and the strategic selection of algorithms. Students will learn to decompose problems into parallelizable units, apply computational thinking to bridge the gap between domain science and hardware architecture, and evaluate algorithm performance.
Learning Outcomes:
- Identify the primary goals of parallel programming and calculate theoretical speedup using Amdahl’s Law.
- Differentiate between task-level and data-level decomposition and apply atom-centric (scatter) versus grid-centric (gather) strategies.
- Evaluate and select parallel algorithms based on criteria such as memory bandwidth, computational complexity, and architectural constraints.
Overview: This lesson introduces OpenCL as a framework for heterogeneous parallel computing, focusing on its data parallelism model and hierarchical hardware abstraction. Students will learn to map OpenCL’s NDRange and memory structures to CUDA equivalents and master the host-side management of devices through a dynamic compilation model.
Learning Outcomes:
- Map OpenCL parallelism and memory hierarchies to CUDA-specific architectures (e.g., mapping Work-groups to Blocks and Local Memory to Shared Memory).
- Implement OpenCL kernel functions and manage the host-side execution environment using Contexts and Command Queues.
- Execute the dynamic compilation workflow to build kernels from source code at runtime.
Overview: This lesson explores the architectural and functional evolution of GPUs, focusing on the transition toward sophisticated memory management, enhanced kernel execution capabilities, and increased core performance. Students will examine how features like Unified Device Memory Space and kernel-level function calls transition the GPU into a general-purpose processor.
Learning Outcomes:
- Explain the significance of Memory Architecture Evolution and the move toward a 64-bit Unified Device Memory Space.
- Analyze how Enhanced Atomic Operations and kernel-level Function Calls enable the implementation of complex data structures and algorithms.
- Evaluate the performance impacts of Simultaneous Kernel Execution, Double-Precision Speed improvements, and Control Flow Efficiency in modern GPU environments.