### Talks to check out first:
 Introduction to Reinforcement Learning by Joelle Pineau, McGill University:
 Applications of RL.
 When to use RL?
 RL vs supervised learning
 What is MDP? Markov Decision Process
 Components of an RL agent:
 states
 actions (Probabilistic effects)
 Reward function
 Initial state distribution
 Explanation of the Markov Property:
 Why Maximizing utility in:
 Episodic tasks
 Continuing tasks
 The discount factor, gamma γ
 What is the policy & what to do with it?
 A policy defines the actionselection strategy at every state:
 Value functions:
 The value of a policy equations are (two forms of) Bellman’s equation.
 (This is a dynamic programming algorithm).
 Iterative Policy Evaluation:
 Main idea: turn Bellman equations into update rules.
 Optimal policies and optimal value functions.
 Finding a good policy: Policy Iteration (Check the talk Below By Peter Abeel)
 Finding a good policy: Value iteration
 Asynchronous value iteration:
 Instead of updating all states on every iteration, focus on important states.
 Key challenges in RL:
 Designing the problem domain
 State representation
– Action choice
– Cost/reward signal  Acquiring data for training
– Exploration / exploitation
– High cost actions
– Timedelayed cost/reward signal  Function approximation
 Validation / confidence measures
 The RL lingo.
 In large state spaces: Need approximation:
 Fitted Qiteration:
 Use supervised learning to estimate the Qfunction from a batch of training data:
 Input, Output and Loss.
 i.e: The Arcade Learning Environment

Deep Qnetwork (DQN) and tips.
 Why Policy Optimization?
 Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed
 Likelihood Ratio (LR) Policy Gradient
 Natural Gradient / Trust Regions (> TRPO)
 ActorCritic (> GAE, A3C)
 Path Derivatives (PD) (> DPG, DDPG, SVG)
 Stochastic Computation Graphs (generalizes LR / PD)
 Guided Policy Search (GPS)
 Inverse Reinforcement Learning
 Inverse RL vs. behavioral cloning
### Books:
 Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
 Algorithms for Reinforcement Learning.
 Reinforcement Learning and Dynamic Programming using Function Approximators.
### Courses:
 Reinforcement Learning by David Silver.
 Lecture 1: Introduction to Reinforcement Learning
 Lecture 2: Markov Decision Processes
 Lecture 3: Planning by Dynamic Programming
 Lecture 4: ModelFree Prediction
 Lecture 5: ModelFree Control
 Lecture 6: Value Function Approximation
 Lecture 7: Policy Gradient Methods
 Lecture 8: Integrating Learning and Planning
 Lecture 9: Exploration and Exploitation

Lecture 10: Case Study: RL in Classic Games

CS294 Deep Reinforcement Learning by John Schulman and Pieter Abbeel.
 Lecture 1: intro, derivative free optimization
 Lecture 2: score function gradient estimation and policy gradients
 Lecture 3: actor critic methods
 Lecture 4: trust region and natural gradient methods, open problems