4th n Town

We Research And Develop Software That Matters

Studying Reinforcement Learning Guide

### Talks to check out first:

  • Introduction to Reinforcement Learning by Joelle Pineau, McGill University:
  • Applications of RL.
  • When to use RL?
  • RL vs supervised learning
  • What is MDP? Markov Decision Process
  • Components of an RL agent:
  • states
  • actions (Probabilistic effects)
  • Reward function
  • Initial state distribution
  • Explanation of the Markov Property:
  • Why Maximizing utility in:
  • Episodic tasks
  • Continuing tasks
  • The discount factor, gamma γ
  • What is the policy & what to do with it?
  • A policy defines the action-selection strategy at every state:
  • Value functions:
  • The value of a policy equations are (two forms of) Bellman’s equation.
  • (This is a dynamic programming algorithm).
  • Iterative Policy Evaluation:
  • Main idea: turn Bellman equations into update rules.
  • Optimal policies and optimal value functions.
  • Finding a good policy: Policy Iteration (Check the talk Below By Peter Abeel)
  • Finding a good policy: Value iteration
  • Asynchronous value iteration:
  • Instead of updating all states on every iteration, focus on important states.
  • Key challenges in RL:
  • Designing the problem domain
  • State representation
    – Action choice
    – Cost/reward signal
  • Acquiring data for training
    – Exploration / exploitation
    – High cost actions
    – Time-delayed cost/reward signal
  • Function approximation
  • Validation / confidence measures
  • The RL lingo.
  • In large state spaces: Need approximation:
  • Fitted Q-iteration:
  • Use supervised learning to estimate the Q-function from a batch of training data:
  • Input, Output and Loss.
  • i.e: The Arcade Learning Environment
  • Deep Q-network (DQN) and tips.

  • Deep Reinforcement Learning

  • Why Policy Optimization?
  • Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed
  • Likelihood Ratio (LR) Policy Gradient
  • Natural Gradient / Trust Regions (-> TRPO)
  • Actor-Critic (-> GAE, A3C)
  • Path Derivatives (PD) (-> DPG, DDPG, SVG)
  • Stochastic Computation Graphs (generalizes LR / PD)
  • Guided Policy Search (GPS)
  • Inverse Reinforcement Learning
  • Inverse RL vs. behavioral cloning

### Books:

### Courses:

  • Reinforcement Learning by David Silver.
  • Lecture 1: Introduction to Reinforcement Learning
  • Lecture 2: Markov Decision Processes
  • Lecture 3: Planning by Dynamic Programming
  • Lecture 4: Model-Free Prediction
  • Lecture 5: Model-Free Control
  • Lecture 6: Value Function Approximation
  • Lecture 7: Policy Gradient Methods
  • Lecture 8: Integrating Learning and Planning
  • Lecture 9: Exploration and Exploitation
  • Lecture 10: Case Study: RL in Classic Games

  • CS294 Deep Reinforcement Learning by John Schulman and Pieter Abbeel.

  • Lecture 1: intro, derivative free optimization
  • Lecture 2: score function gradient estimation and policy gradients
  • Lecture 3: actor critic methods
  • Lecture 4: trust region and natural gradient methods, open problems