AI029 Postgraduate

Reinforcement Learning: An Introduction

A comprehensive foundational textbook on reinforcement learning, covering key algorithms such as Q-learning, Sarsa, and TD-learning, while bridging the gap between tabular methods and function approximation.

4.7

30.0h

1179 students

0 likes

Artificial Intelligence

Start Learning

Lessons

Lesson

1 Lesson 1

This lesson introduces reinforcement learning as a goal-directed, trial-and-error approach to learning through interaction with an environment. Students will learn to distinguish this selectional paradigm from supervised learning and identify the core components of an RL system, including the agent-environment interface, rewards, and the exploration-exploitation trade-off.

2 Lesson 2

This lesson introduces the multi-armed bandit problem as a fundamental framework for understanding the exploration-exploitation trade-off in reinforcement learning. Students will learn to implement action-value estimation methods, such as sample averages and epsilon-greedy strategies, to optimize decision-making under uncertainty.

3 Lesson 3

AI029: Finite Markov Decision Processes (Lesson 3) This lesson introduces the agent-environment interface and the Finite Markov Decision Process (MDP) framework, which serves as the mathematical foundation for reinforcement learning. Students will learn to define environment dynamics using the four-argument probability function and explore how the Markov property enables agents to make optimal decisions based on current states.

4 Lesson 4

This lesson introduces Dynamic Programming as a foundational method for solving Markov Decision Processes by leveraging environment models to perform full backups and bootstrapping. Students will learn to implement iterative policy evaluation and value iteration, focusing on how value information propagates through state spaces to achieve optimal solutions.

5 Lesson 5

This lesson introduces Monte Carlo methods as a model-free alternative to dynamic programming, focusing on learning from sampled experience rather than transition probabilities. Students will learn to estimate state-value and action-value functions through first-visit and every-visit methods, while exploring policy iteration and the importance of avoiding bootstrapping to achieve optimal control.

6 Lesson 6

This lesson introduces Temporal-Difference (TD) learning, a model-free approach that bridges the gap between Monte Carlo methods and Dynamic Programming by using bootstrapping to update value estimates incrementally. Students will learn the mechanics of the TD(0) update rule and explore how TD methods enable online learning in continuing tasks by leveraging immediate transitions rather than waiting for final outcomes.

7 Lesson 7

This lesson explores n-step bootstrapping as a method to bridge the gap between 1-step TD and Monte Carlo methods, balancing the bias-variance tradeoff to improve learning efficiency. Students will learn to implement TD(lambda) and understand how the lambda parameter creates a composite return that optimizes information propagation in reinforcement learning tasks.

8 Lesson 8

This lesson introduces the Dyna architecture, a unified framework that integrates direct reinforcement learning with model-based planning by using a shared internal model to simulate experiences. Students will learn how to differentiate between model-based and model-free methods and understand how the same backup algorithms can be applied to both real and simulated data to improve sample efficiency.

9 Lesson 9

This lesson explores the transition from tabular reinforcement learning to function approximation, explaining how parameterized models overcome the curse of dimensionality by generalizing across state spaces. It also introduces the Mean Squared Value Error objective, which uses the on-policy distribution to prioritize learning accuracy in frequently visited states.

10 Lesson 10

This lesson explores the mathematical parallels between biological learning processes and reinforcement learning, specifically focusing on how animal conditioning models and dopaminergic pathways in the brain inform modern temporal-difference (TD) learning. Students will analyze the isomorphism between biological reward prediction errors and computational algorithms to understand how these principles enhance AI agent efficiency and decision-making.

Course Overview

📚 Content Summary

Master the definitive science of goal-directed learning from interaction.

Author: Richard S. Sutton and Andrew G. Barto

Acknowledgments: Supported by the Air Force Office of Scientific Research, the National Science Foundation, and GTE Laboratories.

🎯 Learning Objectives

Define Reinforcement Learning and distinguish between immediate rewards and long-term value functions.
Identify and describe the four sub-elements of a reinforcement learning system.
Apply the Temporal-Difference (TD) update rule to a state-value lookup table.
Implement incremental update rules for both stationary and nonstationary reward distributions.
Evaluate the effectiveness of Optimistic Initial Values and Upper-Confidence-Bound (UCB) action selection in promoting exploration.
Explain the mechanics of Gradient Bandit algorithms using numerical preferences and softmax distributions.
Define the agent–environment interface and model complex tasks using the MDP framework.
Calculate expected returns for both episodic and continuing tasks using discounting and unified notation.
Distinguish between state-value (v_\pi) and action-value (q_\pi) functions and derive them using Bellman equations.
Perform Policy Evaluation to compute state-value functions for arbitrary policies using iterative methods.