Introduction to ROCm and HIP Programming: A Practical Tutorial
A practical, modern guide to AMD GPU programming with ROCm and HIP. It covers the full software stack, installation, build workflows, kernel programming, memory management, performance engineering, library usage, CUDA porting, and production debugging practices.
Lessons
Lesson
This lesson introduces the ROCm platform and the HIP programming model as a bridge for porting CUDA applications to AMD hardware. Students will learn how to use automated tools like hipify to migrate code while understanding the importance of architecture-aware tuning to achieve optimal performance.
This lesson covers the essential steps for installing and configuring the ROCm software stack, including dependency management, environment variable setup, and user permission requirements. Students will learn how to verify their system environment and ensure successful hardware-software communication through diagnostic tools and proper configuration.
This lesson explores the distinction between source portability and binary performance in the ROCm ecosystem, emphasizing that while HIP code is functionally portable, achieving peak throughput requires architecture-specific compilation. Students will learn to utilize the hipcc toolchain and CMake to manage build configurations that optimize code for specific hardware instruction sets.
This lesson introduces the HIP programming model, focusing on the transition from sequential CPU iteration to spatial GPU parallelism using the Parallel Pivot approach. Students will learn to map independent data tasks to thread grids, manage memory, and implement kernel execution with proper boundary checks and error handling.
AI024: Memory Management and Data Patterns (Lesson 5) explores the memory-centric nature of GPU performance, focusing on the Roofline Model and the critical importance of minimizing data movement between host and device. Students will learn to distinguish between memory-bound and compute-bound kernels while mastering strategies to optimize data residence and bandwidth utilization.
This lesson explores the transition from synchronous to asynchronous GPU execution, focusing on how to use HIP streams to decouple CPU and GPU tasks. Students will learn to optimize performance by implementing non-blocking memory transfers and kernel launches to maximize hardware utilization and eliminate execution bottlenecks.
This lesson introduces a systematic, data-driven approach to performance engineering on AMD GPUs, emphasizing the use of tools like rocprofv3 to identify bottlenecks rather than relying on intuition. Students will learn to follow a six-step scientific workflow to optimize memory access, instruction throughput, and hardware utilization while avoiding common performance "superstitions."
This lesson introduces the Library-First Engineering Principle, which emphasizes using optimized ROCm libraries like rocBLAS and rocFFT to reduce technical debt and ensure hardware portability. Students will learn to prioritize these vendor-tuned solutions over custom kernel development to achieve better performance and easier maintenance across evolving GPU architectures.
AI024: Porting CUDA Applications to HIP (Lesson 9) covers the systematic, incremental migration of CUDA code to the HIP platform using tools like HIPIFY-Clang and HIPIFY-Perl. Students will learn to distinguish between mechanical API translations and architectural optimizations, such as adjusting for warp-size differences, to ensure functional and performance parity on AMD ROCm hardware.
This lesson explores the GPU Developer’s Creed, which prioritizes functional correctness and architectural isolation over raw performance when working with ROCm and HIP. Students will learn to implement systematic debugging, testing, and CI/CD practices to ensure stable, reproducible, and accurate GPU kernel deployments.
Course Overview
📚 Content Summary
A practical, modern guide to AMD GPU programming with ROCm and HIP. It covers the full software stack, installation, build workflows, kernel programming, memory management, performance engineering, library usage, CUDA porting, and production debugging practices.
Master AMD GPU programming and CUDA-to-HIP portability with this technical deep dive.
Author: EvoClass
Acknowledgments: AMD official ROCm and HIP documentation base, including projects like ROCm, HIP, and ROCm LLVM.
🎯 Learning Objectives
- Define HIP and its role within the ROCm ecosystem in a single concise sentence.
- Distinguish between ROCm (platform), HIP (interface), and ROCm libraries (building blocks).
- Identify the hierarchical layers of the ROCm architecture from hardware to application frameworks.
- Define the relationship between the HIP SDK and the ROCm platform across different operating systems.
- Execute a systematic installation workflow, including support matrix verification and post-installation path configuration.
- Compile and run a minimal verification program to troubleshoot common driver and environment access issues.
- Understand why a robust build strategy is essential for reconciling source portability with architecture-specific performance.
- Implement portable kernel launches using the
hipLaunchKernelGGLmacro as an alternative to CUDA's triple-angle-bracket syntax. - Configure production-grade CMake projects that target specific ROCm architectures and manage external library dependencies.
- Define the anatomy of a HIP kernel and apply the basic execution formula for thread indexing.
Lessons
Overview: This lesson provides a foundational overview of the ROCm platform and the HIP programming language. It clarifies the relationship between the full ROCm stack, the HIP interface, and high-level libraries, while establishing realistic expectations for CUDA-to-AMD portability and performance engineering.
Learning Outcomes:
- Define HIP and its role within the ROCm ecosystem in a single concise sentence.
- Distinguish between ROCm (platform), HIP (interface), and ROCm libraries (building blocks).
- Identify the hierarchical layers of the ROCm architecture from hardware to application frameworks.
Overview: This lesson guides GPU developers and HPC engineers through the essential strategies for setting up a HIP-ready environment on both Linux and Windows platforms. It emphasizes a "platform reality" approach where developers must verify hardware/software compatibility before proceeding with a structured installation workflow and final verification using the hipcc compiler.
Learning Outcomes:
- Define the relationship between the HIP SDK and the ROCm platform across different operating systems.
- Execute a systematic installation workflow, including support matrix verification and post-installation path configuration.
- Compile and run a minimal verification program to troubleshoot common driver and environment access issues.
Overview: This lesson explores the essential toolchain and organizational strategies for developing HIP applications on AMD hardware. It transitions the developer from simple command-line builds using the hipcc driver to professional, production-ready project configurations using CMake. Key focus areas include portable kernel launch macros, architecture-specific optimization, and the critical distinction between source-level portability and binary performance.
Learning Outcomes:
- Understand why a robust build strategy is essential for reconciling source portability with architecture-specific performance.
- Implement portable kernel launches using the
hipLaunchKernelGGLmacro as an alternative to CUDA's triple-angle-bracket syntax. - Configure production-grade CMake projects that target specific ROCm architectures and manage external library dependencies.
Overview: This lesson explores the fundamental architecture of HIP kernels, focusing on how work is mapped from logical problems to hardware execution through grids and blocks. It provides a blueprint for robust GPU programming, covering the essential execution formula, performance bottlenecks (memory vs. compute), and the mandatory implementation of error-checking and synchronization for production-ready code.
Learning Outcomes:
- Define the anatomy of a HIP kernel and apply the basic execution formula for thread indexing.
- Configure grid and block sizes effectively and implement benchmarking to find optimal throughput.
- Implement robust error-handling macros and apply synchronization semantics to manage device-host interaction.
Overview: This lesson focuses on the central pillar of GPU programming: memory management. It covers the categorization of memory types (Pageable, Pinned, Device, and Managed), the performance implications of data transfer mechanisms, and the critical role of memory access patterns—specifically coalescing—in achieving peak performance. Students will learn to balance the ease of use provided by managed memory with the explicit control required for high-performance HPC applications.
Learning Outcomes:
- Differentiate between pageable and pinned host memory and identify when to use each for optimal transfer speed.
- Implement device memory allocation and unified/managed memory using HIP APIs (
hipMalloc,hipHostMalloc,hipMallocManaged). - Analyze memory access patterns to ensure coalesced access and avoid performance bottlenecks like strided access.
Overview: This lesson transitions developers from a synchronous programming model to a concurrent mindset, focusing on how to maximize GPU utilization through HIP streams and events. It covers the mechanics of overlapping data transfers with kernel execution via chunked pipelines and introduces the trade-offs between stream capture and explicit graph construction. Additionally, it highlights critical production considerations, including the use of graph-safe libraries and high-precision timing on the GPU.
Learning Outcomes:
- Identify the performance benefits of asynchronous execution and concurrent streams over synchronous execution.
- Implement chunked pipelines to overlap host-to-device communication with kernel computation.
- Differentiate between stream capture and explicit graph construction for reducing launch overhead.
Overview: This lesson establishes a scientific framework for optimizing software on AMD hardware, moving beyond guesswork to a systematic, measurement-driven approach. It covers the architectural relationship between Compute Units, wavefronts, and register pressure, while providing practical methodologies for profiling with rocprofv3 and implementing robust benchmarking skeletons.
Learning Outcomes:
- Implement the 6-step HIP optimization workflow to identify and resolve performance bottlenecks.
- Analyze the trade-off between register pressure and occupancy to maximize hardware utilization.
- Execute accurate performance measurements using hardware events and multi-iteration benchmarking best practices.
Overview: This lesson introduces the "Library-first" engineering philosophy, prioritizing high-performance, pre-built ROCm libraries over custom kernel development. It covers the categorization of the ROCm library stack (Math, FFT, Primitives, and ML/AI) and provides a decision framework for choosing between portable hip* interfaces and AMD-native roc* implementations. Additionally, learners will explore the critical requirements for "graph safety" when integrating libraries into HIP graph-captured workflows.
Learning Outcomes:
- Apply the "Library-first" engineering principle to justify the use of pre-tested primitives over custom kernels.
- Distinguish between
hip*androc*libraries based on portability requirements and performance needs. - Categorize ROCm libraries into their respective functional domains (Math, FFT, Primitives, ML/AI).
Overview: This lesson covers the systematic transition of CUDA source code to the portable HIP C++ framework. Students will learn to execute an incremental porting workflow using automated tools like hipify-perl and hipify-clang, identify critical portability traps such as hardware-specific warpSize assumptions, and implement a rigorous validation process to compare performance and correctness post-migration.
Learning Outcomes:
- Execute the 6-step incremental porting workflow to minimize debugging overhead.
- Select and apply the appropriate automated translation tool (
hipify-perlvs.hipify-clang) based on source code complexity. - Identify and resolve architecture-sensitive "portability traps," specifically those involving
warpSizeand mechanical translation errors.
Overview: This lesson covers the essential tools and methodologies for moving GPU kernels from development to production on the ROCm platform. It details the use of ROCgdb and AddressSanitizer for error detection, establishes a rigorous four-layer testing strategy, and provides a production checklist to ensure kernel correctness and performance stability.
Learning Outcomes:
- Use ROCgdb, ltrace, and AddressSanitizer to identify source-level bugs and memory access errors in GPU code.
- Implement a four-layer testing strategy to validate helpers, kernel correctness, edge cases, and performance regressions.
- Apply production code patterns and checklists to manage kernel interfaces, documentation, and environment-driven debugging.