Assignment 3 Estimating the value function of a policy

Your shopping cart is empty.

Programming
Assignment 3
In this assignment, you will implement an
algorithm for estimating the value function of a
policy for a given MDP from a trajectory of the
form state, action, reward, state, action, reward, ….
Data Format
The input to your "evaluator" will be a text file that
provides information in the following format.
Number of states
Number of actions
Discount factor
state1 action1 reward1
state2 action2 reward2
state3 action3 reward3
.
.
.
stateN actionN rewardN
stateN+1
The number of states S and the number of actions A
will be integers greater than 0. Assume that the
states are numbered 0, 1, ..., S - 1, and the actions
numbered 0, 1, ..., A - 1. The discount factor will lie
between 0 (included) and 1 (excluded). The
trajectory over time will be long enough, and the
dynamics of the underlying MDP such, that there is
at least one outgoing transition from each state in
the MDP. Note that the MDP is not episodic: that is,
stateN+1 is not a terminal state (and can occur
within the trajectory multiple times). The trajectory
is merely a finite sequence generated according to
the underlying transition and reward functions, and
terminated at some arbitrary time step.
You can assume that S and A will not exceed 50, and
N, the total number of transitions in the trajectory,
will not exceed 500,000. In this data directory, you
will find two sample data files (d1.txt and d2.txt).
CS 747: Programming Assignment 3 https://www.cse.iitb.ac.in/~shivaram/teaching/c...
1 of 5 03/10/18, 14:14
Output
Given a data file, your evaluator must estimate the
value function V under the policy being followed.
The output, written to standard output, must be in
the following format (Est-V is your estimate of V).
Est-V(0)
Est-V(1)
.
.
.
Est-V(S - 1)
In the data directory enclosed, you will find output
files corresponding to the two data files, which have
solutions in the format above. The values
mentioned in these output files are indeed the true
values (under the same policy) from the MDP being
sampled. Naturally, as you will have to estimate
values based on samples alone, your estimates
cannot be expected to match the true values
perfectly.
Notice that since this is a prediction problem,
wherein a fixed policy is being followed, the actual
names of the actions taken do not matter. Nor does
it matter if the policy being followed is
deterministic or stochastic. Your logic only needs to
consider the state, reward, and next state associated
with each transition.
You are free to implement the evaluator in any
programming language of your choice. Since your
output will be checked automatically, make sure
you have nothing printed to stdout other than the S
lines as above in sequence. If the testing code is
unable to parse your output, you will not receive
any marks.
Submission
You must submit a directory titled [rollno] (such
as 1234567), which contains all your source and
executable files. The directory must contain a script
CS 747: Programming Assignment 3 https://www.cse.iitb.ac.in/~shivaram/teaching/c...
2 of 5 03/10/18, 14:14
titled evaluator.sh, which must take in exactly one
command line argument corresponding to a data
file. For testing your code, the following command
will be used from your [rollno] directory.
./evaluator.sh dataFileName
dataFileName will include the full path. Before you
submit, make sure you can successfully run
evaluator.sh on the departmental (sl2) machines.
Include a file called notes.txt in your [rollno]
directory, that (1) describes the algorithm your
evaluator implements, and (2) provides references
to any libraries and code snippets you have utilised.
It is okay to use libraries for data structures and for
operations such as sorting. However, the logic used
for value prediction must entirely be code that you
have written.
In summary: you must submit your [rollno]
directory, compressed as [rollno].tar.gz (say
1234567.tar.gz) through Moodle. The directory
must contain evaluator.sh, along with all the
sources and executables, as well as a notes.txt file.
Evaluation
Your evaluator will be tested on trajectories
generated from different MDPs and policies. Your
task is to ensure that it prints out a good estimate of
the true value function in each case. Performance
will be quantified based on the (unweighted)
squared distance between your estimate Est-V and
the true value function V: that is,
Error = ∑s⋲S (V(s) - Est-V(s))2.
Recall that as a part of Programming Assignment 2,
you had written code for MDP planning. It will be a
good idea for you to build a testing framework
using that code to (1) generate and record
trajectories of some fixed policy π for some MDP
M; (2) estimate the value function of π as required
CS 747: Programming Assignment 3 https://www.cse.iitb.ac.in/~shivaram/teaching/c...
3 of 5 03/10/18, 14:14
in this assignment; and (3) compare your estimate
with the true value function, which you can
compute using your own code from Programming
Assignment 2. This is exactly the scheme that we
will use for evaluating your answers.
8 marks are reserved for the performance of your
evaluator on unseen trajectories, and 2 marks for
your explanations in notes.txt. Be sure to describe
your approach and explain why you chose it over
alternative approaches.
The TAs and instructor may look at your source
code and notes to corroborate the results obtained
by your program, and may also call you to a faceto-face session to explain your code.
Deadline and Rules
Your submission is due by 11.55 p.m., Sunday,
October 7. You are advised to finish working on
your submission well in advance, keeping enough
time to test it on the sl2 machines and upload to
Moodle. Your submission will not be evaluated
(and will be given a score of zero) if it is not
received by the deadline.
Before submission, make sure that your code runs
for a variety of experimental conditions. Test your
code on the sl2 machines even while you are
developing it: do not postpone this step to the last
minute. If your code requires any special libraries
to run, it is your responsibility to get those libraries
working on the sl2 machines (go through the CSE
bug tracking system to make a request to the system
administrators). Make sure that you upload the
intended version of your code to Moodle (after
uploading, download your submission and test it on
the sl2 machines to make sure it is the correct
version). You will not be allowed to alter your code
in any way after the submission deadline. In short:
your grade will be completely determined by your
CS 747: Programming Assignment 3 https://www.cse.iitb.ac.in/~shivaram/teaching/c...
4 of 5 03/10/18, 14:14
submission on Moodle at the time of the deadline.
Play safe by having it uploaded and tested at least a
few hours in advance.
You must work alone on this assignment. Do not
share any code (whether yours or code you have
found on the Internet) with your classmates. Do not
discuss the design of your solution with anybody
else. Do not see anybody else's code or report.
CS 747: Programming Assignment 3 https://www.cse.iitb.ac.in/~shivaram/teaching/c...
5 of 5 03/10/18, 14:14

Shopping cart

US$0

Assignment 3 Estimating the value function of a policy

More products