Starting from:

$29.99

Assignment 05: Q-learning for a straightforward navigation task

In this assignment, you will be implementing two versions of Q-learning for a straightforward
navigation task; one version will use the basic algorithm, whereas the other will adjust weights that
can then be used to compute Q-values for all state-action pairs.
The download for this assignment contains a single text-file, pipe_world.txt, which describes
a simple environment in which:
• S indicates the starting location for an agent.
• G indicates a goal location.
• _ indicates an empty location.
• M indicates the location of an explosive mine.
In this environment, the agent begins in the starting location, and can move in one of four directions
(up, down, left, and right). Any move that attempts to leave the bounds of the map fails (and
the agent stays where it is). For any other move, the move succeeds with probability 0.8, and the
agent enters the desired square; with probability 0.2, the agent slips to one side or the other as it
moves in the expected direction, with each direction of slippage being equally probable. Thus, for
instance, if the agent starts in the state marked S in the figure below and attempts to go to the
right, it will end up in one of 3 possible locations, each with the probability given. S 0.8
0.1
0.1
As the agent moves, it receives rewards as follows:
• If it enters a location containing a mine, it receives reward rM = −100.
• If it enters the goal location, it receives reward rG = 0.
• Any other result leads to reward rO = −1.
1
For full points, you will write code that does the following:
1. (20 pts.) Your code will first perform Q-learning on the problem. Your code should:
(a) Use a data structure to store explicit values Q(s, a) for each state (location in the gridworld) and action encountered during learning.
(b) Perform 10, 000 learning episodes; each episode will continue until the agent:
i. Reaches the goal location.
ii. Hits a mine.
iii. Takes a number of steps equal to the entire size (width×height) of the environment.
At the beginning of each episode, the agent will begin from the starting location.
(c) You will use control parameters as follows:
i. Future discount: γ = 0.9.
ii. Policy randomness: you will start with ? = 0.9; at the end of every 200 learning
episodes, you will reduce the rate, setting it to ? = 1/(bE/200c + 1). where E is the
number of the episode just completed. This means that after E = 200, 400, 600, . . .,
we will have ? = 0.45, 0.3, 0.225, . . .. Note: on the final run of learning, the agent
should set ? = 0, so that it acts in a purely greedy fashion.
iii. Learning rate: you will start with α = 0.9; at the end of every 1000 learning episodes,
you will reduce the rate, setting it to α = 1/(bE/1, 000c+1). where E is the number
of the episode just completed. This means that after E = 1000, 2000, 3000, . . ., we
will have α = 0.45, 0.3, 0.225, . . ..
(d) After every 100 learning episodes, your code will pause and do 50 test-runs of its current
greedy policy. That is, it will do 50 runs, from the starting location, choosing the action
a that maximizes Q(s, a) for each state s (location) it encounters; each run will continue
until one of the three stopping conditions indicated above is met. During the runs, the
program will keep track of the total reward received, and at the end will write out the
average reward to a file. (I recommend writing to a CSV file, as this will make graphing
your results later easier.) Note: on the last run of learning, this will be a test of the
average performance of the final learned policy.
2. (20 pts.) After the above is complete, your code will do a feature-based version of Q-learning.
In this version, the number of learning episodes and all control parameters will be exactly as
before, and again the code will print out the average reward gained over 50 test-runs of its
current greedy policy, every 100 learning episodes. However, in this version, there will be no
explicit data structure to store Q-values; instead the code will do the following:
(a) It will store a weight vector w = (w1, w2); initially, w = (0, 0).
(b) For any state-action pair (s, a), there will be a feature vector f(s,a) = (f1, f2) as follows:
i. f1 is a normalized Manhattan distance metric, calculated by assuming that the action
a occurs with no slippage, and then returning f1 = MD(s
0
)/MD+, where s
0
is the
location that would result from that action, MD(s
0
) is the Manhattan distance∗
from s
0
to the goal location, and MD+ is the maximum possible such distance.
∗See: https://en.wikipedia.org/wiki/Taxicab_geometry
2
As an example, consider the image below. If we suppose the agent is in location s
marked with an A, then for each of the four actions s
0 would be the state indicated by
the corresponding arrow. In this smaller version of the grid, MD+ = 4 (from either
of the top corners to the goal marked G). Thus, each of the possible directions gives us
the f1 value indicated; for instance, for the down direction, we have f1 = 1/4 = 0.25.
0.25
0.75 A 0.75
G
0.75
ii. f2 is based upon the action a and two inverse distance metrics, calculated based
upon the distance to the mines to the left and right of an agent. As shown in the
image below, each inverse distance is 1/d, where d is the number of moves needed
to reach the mine in that direction. For an agent starting in the square marked A,
for instance, the nearest mine to the left is 2 squares away, for an inverse distance
of 1/2 = 0.5, and to the right the inverse distance is 1/4 = 0.25.
M 0.5 A 0.25 M
We set f2 dependent upon action a as follows:
• If a is a move left or right, then f2 is set to the inverse distance in that direction.
• If a is a move up or down, then f2 is set to the minimum of the two distances.
Thus, for the example just given, we have:
f2 = 0.5 if a = left f2 = 0.25 if a = up
f2 = 0.25 if a = right f2 = 0.5 if a = down
(c) For any state-action pair the Q-value is calculated as follows:
Q(s, a) = w · f(s,a) = w1f1 + w2f2
(d) After each Q-learning iteration, we update weights as explained in lecture 24:
δ = r + γ max
a
0
Q(s
0
, a0
) − Q(s, a)
∀i, wi ← wi + α δ fi(s, a)
3
3. (10 pts.) Along with your code (written in clear, well organized, and well documented fashion),
you will hand in the following:
• A text-file containing the final policy learned by each of the two algorithms. The policies
should each be labeled to indicate which is which. For each square of the grid, print a
single character (row by row, column by column):
– For any square containing a mine or the goal, simply print M or G, respectively, as
in the original input.
– For any other square, print out the action chosen, using U, D, L, and R for up, down,
left, and right, respectively.
• A graph containing the average-value data generated by the two algorithms after every
100 episodes. The graph axes should be clearly labeled, and it should be clear which
data-series is which.
• A text-file containing some analysis of the results shown in the graph and final policies.
You need a couple of paragraphs to explain what you are seeing in the data and why
(you believe) you get the results that you do.
You will hand in a single compressed archive containing:
• Complete source code.
• Any instructions necessary to compile and/or execute the code.
• The text-files and graph described above.
4

More products