4th n Town

We Research And Develop Software That Matters

Studying Reinforcement Learning Guide

### Talks to check out first:

  • Introduction to Reinforcement Learning by Joelle Pineau, McGill University:
  • Applications of RL.
  • When to use RL?
  • RL vs supervised learning
  • What is MDP? Markov Decision Process
  • Components of an RL agent:
  • states
  • actions (Probabilistic effects)
  • Reward function
  • Initial state distribution
  • Explanation of the Markov Property:
  • Why Maximizing utility in:
  • Episodic tasks
  • Continuing tasks
  • The discount factor, gamma γ
  • What is the policy & what to do with it?
  • A policy defines the action-selection strategy at every state:
  • Value functions:
  • The value of a policy equations are (two forms of) Bellman’s equation.
  • (This is a dynamic programming algorithm).
  • Iterative Policy Evaluation:
  • Main idea: turn Bellman equations into update rules.
  • Optimal policies and optimal value functions.
  • Finding a good policy: Policy Iteration (Check the talk Below By Peter Abeel)
  • Finding a good policy: Value iteration
  • Asynchronous value iteration:
  • Instead of updating all states on every iteration, focus on important states.
  • Key challenges in RL:
  • Designing the problem domain
  • State representation
    – Action choice
    – Cost/reward signal
  • Acquiring data for training
    – Exploration / exploitation
    – High cost actions
    – Time-delayed cost/reward signal
  • Function approximation
  • Validation / confidence measures
  • The RL lingo.
  • In large state spaces: Need approximation:
  • Fitted Q-iteration:
  • Use supervised learning to estimate the Q-function from a batch of training data:
  • Input, Output and Loss.
  • i.e: The Arcade Learning Environment
  • Deep Q-network (DQN) and tips.

  • Deep Reinforcement Learning

  • Why Policy Optimization?
  • Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed
  • Likelihood Ratio (LR) Policy Gradient
  • Natural Gradient / Trust Regions (-> TRPO)
  • Actor-Critic (-> GAE, A3C)
  • Path Derivatives (PD) (-> DPG, DDPG, SVG)
  • Stochastic Computation Graphs (generalizes LR / PD)
  • Guided Policy Search (GPS)
  • Inverse Reinforcement Learning
  • Inverse RL vs. behavioral cloning

### Books:

### Courses:

  • Reinforcement Learning by David Silver.
  • Lecture 1: Introduction to Reinforcement Learning
  • Lecture 2: Markov Decision Processes
  • Lecture 3: Planning by Dynamic Programming
  • Lecture 4: Model-Free Prediction
  • Lecture 5: Model-Free Control
  • Lecture 6: Value Function Approximation
  • Lecture 7: Policy Gradient Methods
  • Lecture 8: Integrating Learning and Planning
  • Lecture 9: Exploration and Exploitation
  • Lecture 10: Case Study: RL in Classic Games

  • CS294 Deep Reinforcement Learning by John Schulman and Pieter Abbeel.

  • Lecture 1: intro, derivative free optimization
  • Lecture 2: score function gradient estimation and policy gradients
  • Lecture 3: actor critic methods
  • Lecture 4: trust region and natural gradient methods, open problems

“What is Google Map’s algorithm on recommending routes?”


I love this question, because it’s a little open ended topic and I have worked around implementing crowdsourced ETA algorithms for given routes in few of my previous projects.

I have answers to two other versions of this question here:

As Yevgeni pointed out, this is not a public topic and the open source solutions out there don’t really meet the expectation most of the time. Lots of people would tell you in the field of Geolocation systems and Graph algorithm related solutions, that the set of algorithms used are not hidden. Though, the exact solutions are not common knowledge either.

  • The difference is the customer usage and delivering a solution to that required outcome.
    Waze is more focused on getting you to Point B in a certain time given to you, meaning that re-routing in each end point of an edge is less problematic and that’s why some users love it.
  • While Google Maps, focuses on giving the user an ETA, and using this route with the given ETA is the goal, rather than re-routing to Point B as often as Waze would do.


  • Regarding congestion for Waze, the weights on the edges are going to balance itself out meanwhile re-calculating the route on the way.
  • The same way Google Maps could rely on Android Phones or other Map User, one of the strength of Waze is the network that is has as well. Using those live streams of data can avoid causing graph diffusion.

For finding a specific Paper, I don’t think there are that many or there will be one specific recipe to answer the best routing algorithm. Because to my personal work experience and perhaps some what limited knowledge, most of the graph algorithms can have hybrid implementations to give better outcomes.

Such as:

  • Using one algorithm to find Upper bound, while another for calculating lower bound, and giving a balanced outcome.
  • Other tweaks such as caching the edge weights, for faster updates when looking up a route.
  • Machine Learning is a big factor for sure, but it’s not the ultimate answer for giving a route and doing an exact ETA for required destination.
    • Using factors such as: Events happening and mostly taken routes.
    • Weather as another factor.
    • Patterns in city wide driving habit
    • All above can give a very good chance for drawing conclusions for a route.

You can use some basic simulators to explore the further implementations of Dijkstra VS the other algorithms for such topic:

Pathfinding JS is an open source library for Grid based games solutions, but the maps can be considered a grid as well for simplification and drawing some conclusion on how one specific algorithm can do a better job in a scenario:


As I was talking about Waze doing a lot more recalculations using the updates weights, here is a thread mentioning how the Highway exits and turns are decided as well:

[Page Update] How Waze determines turn/keep/exit maneuvers

Checkout Waze’s Wiki, they are more open about some of their approaches:

Waze Wiki Page

The usage of the map is the most relevant thing to make a better implementation of a graph algorithm. For example the following post from Uber’s Engineering team called out Google Maps ETA not being as accurate as they wanted it to be many times:

source: When Google ETAs Fail…

In reality, it works just fine, but not towards the same outcome as Uber product wants it.

The gem here is that better outcome using a Dijkstra or a collection of graph algorithms on the way can be achieved depending on the need of the customer. The curse is that it can go wrong in so many ways, that’s why starting out with a simple straight forward shortest path can work really well. Then diverging to further hybrid implementations, including predictive models for some parts of the algorithms.

If you’re interested in this topic, there are some good reads online:

On top of all of these factors mentioned, Google still have an advantage over lots of other solutions due to the proper technologies used. Using something like BigQuery, underneath it using “Dremel, which can Scan 35 Billion Rows Without an Index in Tens of Seconds is absolutely a big push to brute force some outcomes if all else fails.


Disclaimer: First draft of this post was originally posted on Quora.


About Author: @yadfaeq, Research Software Engineer, 4th n Town Engineering