Reinforcement Learning: An Introduction
A comprehensive foundational textbook on reinforcement learning, covering key algorithms such as Q-learning, Sarsa, and TD-learning, while bridging the gap between tabular methods and function approximation.
Lessons
Lesson
This lesson introduces reinforcement learning as a goal-directed, trial-and-error approach to learning through interaction with an environment. Students will learn to distinguish this selectional paradigm from supervised learning and identify the core components of an RL system, including the agent-environment interface, rewards, and the exploration-exploitation trade-off.
This lesson introduces the multi-armed bandit problem as a fundamental framework for understanding the exploration-exploitation trade-off in reinforcement learning. Students will learn to implement action-value estimation methods, such as sample averages and epsilon-greedy strategies, to optimize decision-making under uncertainty.
AI029: Finite Markov Decision Processes (Lesson 3) This lesson introduces the agent-environment interface and the Finite Markov Decision Process (MDP) framework, which serves as the mathematical foundation for reinforcement learning. Students will learn to define environment dynamics using the four-argument probability function and explore how the Markov property enables agents to make optimal decisions based on current states.
This lesson introduces Dynamic Programming as a foundational method for solving Markov Decision Processes by leveraging environment models to perform full backups and bootstrapping. Students will learn to implement iterative policy evaluation and value iteration, focusing on how value information propagates through state spaces to achieve optimal solutions.
This lesson introduces Monte Carlo methods as a model-free alternative to dynamic programming, focusing on learning from sampled experience rather than transition probabilities. Students will learn to estimate state-value and action-value functions through first-visit and every-visit methods, while exploring policy iteration and the importance of avoiding bootstrapping to achieve optimal control.
This lesson introduces Temporal-Difference (TD) learning, a model-free approach that bridges the gap between Monte Carlo methods and Dynamic Programming by using bootstrapping to update value estimates incrementally. Students will learn the mechanics of the TD(0) update rule and explore how TD methods enable online learning in continuing tasks by leveraging immediate transitions rather than waiting for final outcomes.
This lesson explores n-step bootstrapping as a method to bridge the gap between 1-step TD and Monte Carlo methods, balancing the bias-variance tradeoff to improve learning efficiency. Students will learn to implement TD(lambda) and understand how the lambda parameter creates a composite return that optimizes information propagation in reinforcement learning tasks.
This lesson introduces the Dyna architecture, a unified framework that integrates direct reinforcement learning with model-based planning by using a shared internal model to simulate experiences. Students will learn how to differentiate between model-based and model-free methods and understand how the same backup algorithms can be applied to both real and simulated data to improve sample efficiency.
This lesson explores the transition from tabular reinforcement learning to function approximation, explaining how parameterized models overcome the curse of dimensionality by generalizing across state spaces. It also introduces the Mean Squared Value Error objective, which uses the on-policy distribution to prioritize learning accuracy in frequently visited states.
This lesson explores the mathematical parallels between biological learning processes and reinforcement learning, specifically focusing on how animal conditioning models and dopaminergic pathways in the brain inform modern temporal-difference (TD) learning. Students will analyze the isomorphism between biological reward prediction errors and computational algorithms to understand how these principles enhance AI agent efficiency and decision-making.
Course Overview
📚 Content Summary
A comprehensive foundational textbook on reinforcement learning, covering key algorithms such as Q-learning, Sarsa, and TD-learning, while bridging the gap between tabular methods and function approximation.
Master the definitive science of goal-directed learning from interaction.
Author: Richard S. Sutton and Andrew G. Barto
Acknowledgments: Supported by the Air Force Office of Scientific Research, the National Science Foundation, and GTE Laboratories.
🎯 Learning Objectives
- Define Reinforcement Learning and distinguish between immediate rewards and long-term value functions.
- Identify and describe the four sub-elements of a reinforcement learning system.
- Apply the Temporal-Difference (TD) update rule to a state-value lookup table.
- Implement incremental update rules for both stationary and nonstationary reward distributions.
- Evaluate the effectiveness of Optimistic Initial Values and Upper-Confidence-Bound (UCB) action selection in promoting exploration.
- Explain the mechanics of Gradient Bandit algorithms using numerical preferences and softmax distributions.
- Define the agent–environment interface and model complex tasks using the MDP framework.
- Calculate expected returns for both episodic and continuing tasks using discounting and unified notation.
- Distinguish between state-value (v_\pi) and action-value (q_\pi) functions and derive them using Bellman equations.
- Perform Policy Evaluation to compute state-value functions for arbitrary policies using iterative methods.
Lessons
Overview: This lesson defines Reinforcement Learning (RL) as a computational approach to learning from interaction, focusing on goal-directed agents. It details the four essential elements of RL systems—policy, reward, value function, and model—and demonstrates these concepts through a Tic-Tac-Toe example using the Temporal-Difference (TD) learning update rule. Finally, it traces the historical roots of RL through the convergence of optimal control (Dynamic Programming) and the psychology of trial-and-error learning (The Law of Effect).
Learning Outcomes:
- Define Reinforcement Learning and distinguish between immediate rewards and long-term value functions.
- Identify and describe the four sub-elements of a reinforcement learning system.
- Apply the Temporal-Difference (TD) update rule to a state-value lookup table.
Overview: This lesson covers advanced strategies for solving the multi-arm bandit problem, focusing on how agents can efficiently estimate action values and balance exploration with exploitation. We transition from simple sample-average methods to incremental implementations that can track nonstationary problems, and explore sophisticated selection mechanisms like UCB, Optimistic Initial Values, and Gradient Bandits. Finally, we introduce the concept of Associative Search (Contextual Bandits) as a bridge toward full Reinforcement Learning.
Learning Outcomes:
- Implement incremental update rules for both stationary and nonstationary reward distributions.
- Evaluate the effectiveness of Optimistic Initial Values and Upper-Confidence-Bound (UCB) action selection in promoting exploration.
- Explain the mechanics of Gradient Bandit algorithms using numerical preferences and softmax distributions.
Overview: This lesson introduces the formal mathematical framework of Finite Markov Decision Processes (MDPs), the standard way to describe Reinforcement Learning problems. It explores the interaction between an agent and its environment, defines the reward hypothesis, and establishes the fundamental Bellman equations used to calculate value functions and identify optimal policies.
Learning Outcomes:
- Define the agent–environment interface and model complex tasks using the MDP framework.
- Calculate expected returns for both episodic and continuing tasks using discounting and unified notation.
- Distinguish between state-value (v_\pi) and action-value (q_\pi) functions and derive them using Bellman equations.
Overview: This lesson covers Dynamic Programming (DP) as a collection of algorithms used to compute optimal policies given a perfect model of the environment as a Markov Decision Process (MDP). It explores the iterative processes of Policy Evaluation and Improvement, the unification of these processes into Policy and Value Iteration, and the flexible framework of Generalized Policy Iteration (GPI), including the computational efficiency and asynchronous variations of these methods.
Learning Outcomes:
- Perform Policy Evaluation to compute state-value functions for arbitrary policies using iterative methods.
- Implement and differentiate between Policy Iteration and Value Iteration to find optimal policies and value functions.
- Understand the mechanics of Asynchronous Dynamic Programming and the conceptual framework of Generalized Policy Iteration (GPI).
Overview: This lesson focuses on Monte Carlo (MC) methods for estimating value functions and discovering optimal policies through sampled experience. We explore the transition from policy evaluation to control, comparing on-policy methods (including epsilon-greedy strategies) and off-policy methods that utilize importance sampling to learn about a target policy from a different behavior policy.
Learning Outcomes:
- Design and implement Monte Carlo Control algorithms with and without Exploring Starts.
- Contrast on-policy and off-policy learning architectures, specifically the role of the "coverage" assumption.
- Calculate and apply Ordinary and Weighted Importance Sampling ratios to adjust for policy divergence.
Overview: This lesson covers Temporal-Difference (TD) learning, a central and novel idea in reinforcement learning. TD learning combines Monte Carlo (MC) methods, which learn from experience, with Dynamic Programming (DP) methods, which bootstrap (update estimates based on other learned estimates). We will explore TD prediction, its optimality in batch settings, and its application to control via Sarsa (on-policy), Q-learning (off-policy), and Expected Sarsa.
Learning Outcomes:
- Define TD(0) prediction and explain the concept of "bootstrapping" in the context of the TD target and error.
- Compare and contrast the optimality of TD(0) and the concept of Certainty Equivalence in batch training.
- Implement and differentiate between Sarsa (on-policy) and Q-learning (off-policy) control algorithms.
Overview: This lesson explores the bridge between n-step TD methods and Monte Carlo techniques through the framework of eligibility traces. It introduces the lambda-return as a theoretical bridge (the Forward View) and Eligibility Traces as a mechanistic implementation (the Backward View), culminating in the study of different trace update rules such as Accumulating, Replacing, and Dutch traces.
Learning Outcomes:
- Define the lambda-return and explain how it averages n-step returns using exponential weighting.
- Contrast the Forward View (theoretical/look-ahead) with the Backward View (mechanistic/look-back) of TD(lambda).
- Implement and compare different eligibility trace mechanisms: Accumulating, Replacing, and Dutch traces.
Overview: This lesson explores the integration of model-based reinforcement learning (planning) and model-free reinforcement learning (learning). It details how agents can use a model of the environment to generate simulated experience for value updates, primarily through the Dyna-Q architecture. Furthermore, the lesson compares different backup strategies (full vs. sample) and examines specialized search techniques such as Prioritized Sweeping, Trajectory Sampling, and Monte Carlo Tree Search to optimize computational efficiency.
Learning Outcomes:
- Differentiate between direct reinforcement learning and model-based planning within the Dyna architecture.
- Evaluate the computational trade-offs between full backups and sample backups based on branching factors.
- Explain how Prioritized Sweeping and Trajectory Sampling focus computation on the most informative state-action updates.
Overview: This lesson explores how Reinforcement Learning scales to large or continuous state spaces by transitioning from tabular representations to Function Approximation. We cover value prediction using gradient-descent methods, various linear feature construction techniques (like Tile Coding and RBFs), and extend these concepts to control via Actor–Critic architectures and the Average-Reward (R-Learning) setting.
Learning Outcomes:
- Define the objective of value prediction using function approximation and the role of the on-policy distribution.
- Compare and implement linear feature construction methods, including Tile Coding, Radial Basis Functions (RBFs), and Kanerva Coding.
- Design and analyze Actor–Critic architectures utilizing eligibility traces for policy and value updates.
Overview: This lesson explores the interdisciplinary convergence of Reinforcement Learning (RL), bridging the gap between biological intelligence (psychology and neuroscience) and complex engineering applications. Students will analyze how temporal-difference (TD) methods scale from classic games like Backgammon to high-dimensional industrial challenges like elevator dispatching, telecommunications, and job-shop scheduling, concluding with a vision for the future of the field.
Learning Outcomes:
- Identify the parallels between RL algorithms and biological learning mechanisms in psychology and neuroscience.
- Analyze the architecture and representation strategies used in classic RL successes such as TD-Gammon and the Acrobot task.
- Formulate real-world stochastic control problems using Semi-Markov Decision Processes (SMDP) in continuous-time environments.