This demonstrates reinforcement learning. Specifically, it
uses Q-learning
to move a player (@) around a fixed maze and avoid traps (^) while getting
treasure ($) as fast as possible.
Add the directory containing srl to PYTHONPATH. Then there are three
ways to run the grid.py program:
srl/grid.py --interactive [--random]: Use the arrow keys to walk
around the maze. The episode ends when you reach a trap or the
treasure. Press space to restart or Q to quit. No learning happens
in this mode. Use --random to generate a random maze instead of
the fixed maze.
srl/grid.py --q [--random]: An ε-greedy Q-learner
repeatedly runs the maze. The parameters are not tuned to learn
quickly. Over the course of several minutes the player first learns
to avoid spikes, then reach the treasure, and eventually reach the
treasure in the minimum number of steps.
Learning is not saved between runs.
The Q network is not reset between episodes, so it does not
generalize to new random maps. This leads to very poor performance.
srl/all_tests.py: Run the unit tests.
Here are some ideas in ways to extend grid.py. These are increasingly difficult.
Some early steps may be useful for later steps.
Watch different random mazes and see how features like long hallways
affect learning.
Change the learning rate α and future reward discount γ to try to
improve the effectiveness of the learner.
Display summaries of results. For example, graph the distribution of rewards
(Simulation.score) over repeated runs. It may be useful to separate the game
simulation loop and the display so that you can simulate faster.
Implement TD(λ)
and eligibility traces. How do features like long hallways affect learning now?
Q-learning is an “off-policy” learning algorithm, which means the policy
controlling the player and the policy being learned can be different. Adapt
the HumanPlayer to permit a learner to learn by observing a human play.
Save policies to disk and load them later. This will let you checkpoint and
restart learning.
Generate a new maze each episode. The state space will be too large
for a QTable to be useful, so implement a value function
approximator such as a neural network. The QTable memorizes the
fixed map; with multiple maps you will need to feed the maze as
input to the neural network so it can “see” the map instead of
memorizing it.
Connect the learner to an actual roguelike such
as NetHack to speed run for dungeon depth.
Change the problem from one with discrete states (a grid) and actions (up,
down, left, right) to continuous states (the player is at fine-grained x-y
coordinates with a certain heading and velocity) and actions (acceleration and
turning.) How will you relate states to each other in a continuous space?
Simple Reinforcement Learning
This demonstrates reinforcement learning. Specifically, it uses Q-learning to move a player (
@) around a fixed maze and avoid traps (^) while getting treasure ($) as fast as possible.Add the directory containing srl to PYTHONPATH. Then there are three ways to run the grid.py program:
srl/grid.py --interactive [--random]: Use the arrow keys to walk around the maze. The episode ends when you reach a trap or the treasure. Press space to restart or Q to quit. No learning happens in this mode. Use--randomto generate a random maze instead of the fixed maze.srl/grid.py --q [--random]: An ε-greedy Q-learner repeatedly runs the maze. The parameters are not tuned to learn quickly. Over the course of several minutes the player first learns to avoid spikes, then reach the treasure, and eventually reach the treasure in the minimum number of steps.Learning is not saved between runs.
The Q network is not reset between episodes, so it does not generalize to new random maps. This leads to very poor performance.
srl/all_tests.py: Run the unit tests.Here are some ideas in ways to extend grid.py. These are increasingly difficult. Some early steps may be useful for later steps.