$30
COMP-424: Artificial intelligence
Homework 4
General instructions.
Submit a single pdf document containing all your pages of your written solution on your McGill’s
myCourses account. You can scan-in hand-written pages. If necessary, learn how to combine many pdf
files into one.
Question 1: Hidden Markov Models
You need to design a HMM which can be used to make decisions regarding buying and selling
stocks. Your HMM should have two states {High, Low} and three observations {Buy, Sell,
Keep}.
Starting probability of stock market being in a low state is 0.4. At each time step if the market is
in a low state, then there is 0.4 probability that market will remain in low state during the next
time step. If the market is in high state then there is 0.3 probability that it will remain in high
state. Given that market is in high state probability of observing Sell is 0.6 and probability of
observing Keep is 0.3. If the market is in low state the probability of observing Buy is 0.5 and
probability of observing Keep is 0.3.
a) Draw the HMM that describes this scenario, clearly indicating the relevant parameters
using conditional probability tables.
b) What is the probability of the observation sequence {Buy, Keep, Sell}?
c) What is the probability of the stock state being High after observing the sequence in b)?
d) What is the most likely sequence of three states explain the observation sequence in b)?
For each of these tasks, write down your calculations, or provide the code that you wrote to
compute the answer.
Question 2: Utility
Consider the Bayes Net shown here, with all Bernoulli
variables, which models driving accidents. Having an accident
has a utility of -200 if the car was insured, but has a utility of
-400 if the car was not insured (or the insurance claim is
denied). Having insurance when there is no accident has a
utility of -10, and not having insurance and an accident has a
utility of 0.
Use the principle of Maximum Expected Utility and Value of Information to
answer the following questions. For parts a)-c) assume the insurance
company pays for 100% of the cases.
a) Given no information on whether the driver will be driving in a sick
or drunk state, how much should they pay to insure the car?
b) How much should they pay for insurance if you know for certain that
they will drive both sick (S=True) and not drunk (D=False)?
c) How much should they pay for insurance if you know that they will drive drunk
(D=True)?
d) A company is offering cheaper insurance but has a reputation of rejecting 20% of
insurance claims. How much should they charge for this insurance, to make it
competitive with the insurance offered by the more reliable company? (Hint: Set the
cost of the new insurance to have the same MEU as the other insurance.)
S (F) S (T)
0.8 0.2
D (F) D (T)
0.9 0.1
S D A (T)
F F 0.1
F T 0.2
T F 0.3
T T 0.5
Sick
(S)
Drunk
(D)
Accident
(A)
Question 3: Markov Decision Processes
Consider the MDP shown below. It has 6 states and 4 actions. As shown on the figure, the
transitions for all actions have a Pr=0.8 of succeeding (and leading to the state shown by the
arrow) and Pr=0.2 of failing (in which case the agent stays in place). For other transitions that
are not shown, assume that they cause the state to stay the same (e.g. T(S1,Left,S1)=1). The
rewards depend on state only and are shown in each node (state); rewards are the same for all
actions (e.g. R(S4,a)=+10, a). Assume a discount factor of =0.9.
a) Describe the space of all possible policies for this MDP. How many are there?
b) Assuming an initial policy
0
(s)=Right, s, perform policy evaluation to get the initial
value function for each state, V
0
(s) , s.
c) Given the initial estimate, V
0
, if you run an iteration of policy improvement, what will be
the new policy at each state? If necessary, break ties alphabetically, e.g. “Down” before
“Left”, etc.)
d) What is the optimal value function at each state for this domain?
e) Is the optimal value function unique? Explain.
f) What is the optimal policy at each state for this domain?
g) Is the optimal policy unique? Explain.
h) Suggest a change to the reward function that changes the value function but does not
change the optimal policy.
Question 4: Bandits
Consider the following 5-armed bandit problem. The initial value estimates of the arms are given
by Q = {1, 1, 3, 1, 1}, and the actions are represented by A={1, 2, 3, 4, 5}. We observe a
trajectory consisting of plays and rewards: T={A1=3, R1=2, A2=5, R2=0, A3=3, R3=1, A4=1,
R4=0, A5=3, R5=0}.
a) Show the estimated Q values at each time step of the trajectory using the average of the
observed rewards, where available. Do not consider the initial estimates as samples.
b) It turns out the player was following an epsilon-greedy strategy. For each time step,
report whether it can be concluded with certainty that a random action was selected.
S1
R=0
S2
R=0
S3
R=+20
S4
R=+10
S5
R=0
S6
R=-10
Down Down Up Up
Left
Right
Left
Right
Left
Right
Left
Right