Starting from:

$35

Assignment #2 Real-time Heuristic Search

CMPUT 366: Assignment #2

Abstract
For this assignment use the following consultation model:
• you can discuss assignment questions and exchange ideas with other CMPUT 366 students;
• you must list all members of the discussion in your solution;
• you may not share/exchange/discuss written material and/or code;
• you must write up your solutions individually;
• you must fully understand and be able to explain your solution in any amount of detail as requested by
the instructor and/or the TAs.

Submission
The assignment you downloaded from eClass is a single PDF file. Your answers are to be submitted electronically via eClass.
Your submission must be a single PDF file with all of your answers. To generate the PDF file you can do any of the following:
1. use PDF-annotation software to embed your answers into the PDF file;
2. print out the provided PDF file and legibly write your answers in the blank spaces under each question. Make sure you
write as legibly as possible for we cannot give you any points if we cannot read your hand-writing. Then scan the pages
and include the scan in your ZIP submission to be uploaded on eClass. Taking cellphone pictures of the pages does not
normally produce legible scans, avoid doing so;
3. use your favourite text processor (e.g., LATEX) and type up your answers there. Make sure you number your answers in
the same way as the questions are numbered in this assignment.
1
1 Real-time Heuristic Search
A real-time heuristic search agent is traversing a graph G = (S, E) where S is the set of
states/vertices and E is the set of edges connecting the vertices. For each state s ∈ S, N(s)
is the set of its immediate neighbors: N(s) = {s
0
| s
0 ∈ S & (s, s0
) ∈ E}. Both S and E are
finite. All edges in E are undirected (i.e., if an edge (s1, s2) ∈ E then the agent can move
from s1 to s2 and also from s2 to s1). The edge cost is a function c : E → R. All edge costs
are above zero and symmetric: c(s1, s2) = c(s2, s1). The start state is s0 ∈ S. The goal state
is sg ∈ S. The problem is solvable: there is a path from s0 to sg in the graph. The agent’s
task is to reach sg from s0 while minimizing the cost of all edges traversed in the process.
The agent has access to a heuristic h : S → R which is initially 0 for all states. The agent
can modify it in its current position in any way it wants.
1.1
[20 points] Consider the following learning real-time heuristic search algorithm:
Algorithm 1: Basic LRTA*
input : search problem (G, c, s0, sg, h)
output: solution path (s0, s1, . . . , sg)
1 t ← 0
2 ht ← h
3 while st 6= sg do
4 learn: ht+1(st) ← max ?
ht(st), min
s∈N(st)
(c(st
, s) + ht(s))?
5 move: st+1 ← argmin
s∈N(st)
(c(st
, s) + ht(s))
6 t ← t + 1
Suppose the argmin operator in line 5 of Algorithm 1 breaks ties in a fixed order of
preference (e.g., traversing the edge to the left is always preferred to the edge to the right
when the c+h values of both neighbors are identical). Give an example of a small search graph
on which the agent finds two different solutions under two different tie-breaking preference
orders. The initial heuristic should be 0 for each state.
List each solution (i.e., the sequence of states visited by the agent) and explain why
different solutions will be produced.
2
1.2
[2 points] Suppose the set of edges E is modified so that the goal state sg is no longer reachable
from the start state s0 (i.e., the search problem becomes unsolvable). Will Algorithm 1 stop
and declare that no solution exists or will it run forever? Justify your answer.
1.3
[8 points] Consider a unit in a real-time strategy game (e.g., StarCraft) which has to go from
its current location to goal on the map in the figure below. The unit cannot go through walls.
Assume the unit occupies a single grid cell and can move only in four cardinal directions.
Assume the initial heuristic is the Euclidean distance to the goal state.
goal
wall wall wall wall wall
wall
wall
wall
List two pros and two cons of using Algorithm 1 for pathfinding in this problem.
3
1.4
[8 points] Consider the following search problem: S = {s0, s1, s2, s3}; the start state is s1,
the goal state is s3. The undirected edges are E = {(s0, s1),(s1, s2),(s2, s3)} and each edge
costs 1. The initial heuristic is 0 for all states.
Suppose the argmin operator in line 5 of Algorithm 1 breaks ties towards the state with
a higher index (e.g., if the agent is in s1 and c(s1, s0) + h(s0) = c(s1, s2) + h(s2) then the
agent moves to s2). List heuristic values at each time step as the agent traverses the graph
until the goal state is reached. Indicate the agent’s position at each time step.
1.5
[8 points] With the search problem from Question 1.4, suppose now that the argmin operator
in line 5 of Algorithm 1 breaks ties towards the state with a lower index (e.g., if the agent
is in s1 and c(s1, s0) + h(s0) = c(s1, s2) + h(s2) then the agent moves to s0). List heuristic
values at each time step as the agent traverses the graph until the goal state is reached.
Indicate the agent’s position at each time step.
4
2 Markov Decision Processes
Two coins are placed in a tray; each coin is either heads up or tails up with equal probability,
independent of the other coin. A robot has a single arm which can be positioned above either
of the coins. At each time step, the robot can perform one of the following operations:
• Randomize the coin below it: the coin under the arm will be randomly set to either
heads up or tails up with equal probability. This operation costs the robot 1 point.
• Move to be above the other coin; this operation costs the robot 2 points.
• Call the Fairy, who will reward the robot with 20 points if both coins are heads up,
reward the robot with 10 points if both are tails up, or fine the robot 5 points if the
coins mismatch. After rewarding or fining the robot, the Fairy will then randomize
both coins before leaving.
2.1
[10 points] Represent this scenario formally as a Markov Decision Process (MDP), treating
points given to or taken away from the robot as the reward signal. Provide the tuple
(S, A, R, p) as per the lecture slides.
5
2.2
[2 points] Should this be viewed as a continuing or episodic MDP? Justify your answer.
2.3
[3 points] Suppose you have an MDP where each state has two actions available to it:
A = {a

, a−}. Action a

is optimal. Action a
− is suboptimal. In each state, the policy π
selects an action randomly ε percent of the time and selects the optimal action a

the rest of
the time. Express the policy π as a distribution over actions given a state. In other words,
come up with a mathematical expression for π(a|s): the probability of selecting action a ∈ A
in state s. Justify your answer.
3 Value Functions
The value function V
π
(s) introduced in the lectures is the return an agent can expect from
state s on, under the policy π. An alternative is a value function defined on state-action
pairs: Qπ
(s, a) is the return an agent can expect by taking action a in state s and then
following the policy π thereafter.
3.1
[2 points] Give an equation expressing V
π via Qπ
. Your equation should have V
π only on
the left side and Qπ only on the right side.
6
3.2
[2 points] Now go the other way: give an equation expressing Qπ via V
π
. Your equation
should have Qπ only on the left side and V
π only on the right side.
3.3
[4 points] The lectures gave the Bellman equation for V
π
recursively expressing V
π
through
itself and other variables:
V
π
(s) = X
a
π(a|s)
X
s
0
,r
p(s
0
, r|s, a) [r + γV π
(s
0
)] .
Give an analogous Bellman equation for Qπ
. The equation should express Qπ
(s, a) through
the environment dynamics p(s
0
, r|s, a), Qπ
for other states and actions, the discount factor
γ and all the variables involved in those expressions.
3.4
[2 points] V

(s) is the expectation of the highest possible return collectible from state s
(by acting optimally). Similarly, we can define Q∗
(s, a) as the expectation of the highest
possible return collectible by taking action a in state s and acting optimally thereafter. Give
an equation expressing V
∗ via Q∗
. Your equation should have V
∗ only on the left side and
Q∗ only on the right side.
3.5
[2 points] Now go the other way: give an equation expressing Q∗ via V

. Your equation
should have Q∗ only on the left side and V
∗ only on the right side.
7
3.6
[2 points] Give the Bellman optimality equation for Q∗
expressing Q∗ via itself.
4 Policy Improvement
Consider the following MDP with actions A = {a, b}, states S = {W, X, Y, Z}, and the
following dynamics:
p(X, 4|W, a) = 0.5 p(Z, 0|X, a) = 0.8 p(Z, 1|Y, a) = 1
p(Y, 2|W, a) = 0.5 p(Z, 25|X, a) = 0.2 p(Z, 5|Y, b) = 1
p(Z, 3|W, b) = 1 p(Z, 0|X, b) = 1 p(Z, 0|Z, a) = 1
p(Z, 0|Z, b) = 1
All unspecified transitions have probability 0.
4.1
[8 points] Consider the random policy π(a|s) = π(b|s) = 0.5 for any state s ∈ S. Under the
discount rate of γ = 0.8, what is the value V
π
(s) for each state s ∈ S?
8
4.2
[4 points] Construct a deterministic policy π
0
that strictly improves upon π from the previous
question. To do so you must show that V
π
0
(s) ≥ V
π
(s) for all states and V
π
0
(s) V π
(s) for
at least one state.
Tip: to show the inequalities above you can first show that Qπ
(s, π0
(s)) ≥ V
π
(s) for
all states and Qπ
(s, π0
(s)) V π
(s) for at least one state. Then you can use the Policy
Improvement Theorem.
9
5 Monte Carlo Prediction
5.1
[8 points] Consider an episodic MDP with actions A = {a, b, c}, states S = {W, X, Y, Z}
(with terminal state Z) and unknown dynamics. Suppose that you have used a policy π to
generate 4 episodes with the following trajectories:
[S0 = W, A0 = a, R1 = 0, S1 = X, A1 = a, R2 = 10, S2 = Y, A2 = b, R3 = 0, S3 = Z],
[S0 = W, A0 = a, R1 = −10, S1 = X, A1 = b, R2 = 0, S2 = Z],
[S0 = W, A0 = b, R1 = 2, S1 = Y, A1 = c, R2 = 6, S2 = Z],
[S0 = W, A0 = b, R1 = 0, S1 = Y, A1 = c, R2 = 12, S2 = Z].
Use first-visit Monte Carlo prediction to estimate V
π
(s) for every s ∈ S. Assume nondiscounted rewards (i.e., γ = 1). Show your work.

More products