$30
CS 747: Programming Assignment 1
In this assignment, you will implement and compare different
algorithms for sampling the arms of a stochastic multi-armed
bandit. Each arm provides i.i.d. rewards from a Bernoulli
distribution with mean in (0, 1). The objective is to minimise the
regret. The algorithms you will implement are epsilon-greedy
exploration, UCB, KL-UCB, Thompson Sampling, and some
variation of Thompson Sampling. This assignment will
essentially put into practice the sampling algorithms discussed in
class.
Data and Scripts
This directory has (1) a subdirectory called instances containing
three bandit instances, (2) a submission subdirectory in which you
will place all your submission-related material. (3) a script
check.sh inside submission for verifying the input-output
behaviour of your program, (4) a script verifyOutput.py inside
submission for verifying that your submission data is printed in
correct format and reproducible, and (5) a file outputFormat.txt
to illustrate the format of output your code is expected to
generate.
Bandit instances are encoded as text files with the mean rewards
of the arms provided one to a line. Hence the number of lines
gives the number of arms. The instances provided have 2, 5, and
25 arms.
The program you create will have to take in several commandline arguments. By running the checking script provided, you can
make sure that your program consumes the input parameters
correctly and produces the desired output.
You will have to perform multiple random executions of different
algorithms on the given instances and place all your data in a text
file. The example provided shows you the expected format.
All your submission materials should go into the submission
folder, which you will compress and submit.
Tasks
21/11/2020 CS 747: Programming Assignment 1
https://www.cse.iitb.ac.in/~shivaram/teaching/cs747-a2020/pa-1/programming-assignment-1.html 2/6
In this assignment, you should do the following tasks.
T1) Implement the sampling algorithms: (1) epsilon-greedy, (2)
UCB, (3) KL-UCB, and (4) Thompson Sampling. (This task is
straightforward, as discussed in class.)
T2) This task requires a little thought. In this task, we give you
some knowledge of the true means of the arms. Can you do better
than the regular sampling algorithms? More precisely, we give
you the true means of the arms without revealing their arm
numbers (say we provide a random permutation of the true
means). If you think that you can do better given this knowledge,
implement an algorithm based on the principles of Thompson
sampling. The idea is that using the extra knowledge, your
algorithm must perform better than regular Thompson sampling.
Let us call this new algorithm as (5) thompson-sampling-withhint. It is a good idea to to avoid having tunable parameters in
your algorithm. If you use such parameters, you will have to run
with the same static values for all bandit instances (cannot tune
based on the instance).
T3) This is fairly a simple experiment. You need to play with
different values of epsilon for epsilon-greedy. Specifically, is it
possible to give three epsilon values ε1 < ε2 < ε3 for each
instance such the regret of ε2 is less compared to the other two? If
possible, give those three ε values for each instance.
T4) Generate data by running the code written in the above tasks,
and compile graphs and a report based on the data.
These tasks are elaborated in the following subsections.
T1 & T2: Coding
This time we are not giving you the liberty of writing code in any
programming language of your choice. You must code in Python
3.8.2 and not use any special libraries except Numpy v1.19 (to
work with vectors and matrices) and Matplotlib (for generating
plots).
You will have to prepare a source file called bandit.py that
performs the following functions. You can decide for yourself
how to modularise the code and name the internal functions.
What we shall insist upon is the input-output behaviour of
bandit.py, which we describe below.
bandit.py must accept the following command line parameters.
--instance in, where in is a path to the instance file.
--algorithm al, where al is one of epsilon-greedy, ucb,
kl-ucb, thompson-sampling, and thompson-sampling-withhint.
--randomSeed rs, where rs is a non-negative integer.
--epsilon ep, where ep is a number in [0, 1].
21/11/2020 CS 747: Programming Assignment 1
https://www.cse.iitb.ac.in/~shivaram/teaching/cs747-a2020/pa-1/programming-assignment-1.html 3/6
--horizon hz, where hz is a non-negative integer.
Your first job is to simulate a multi-armed bandit. You must read
in the bandit instance and have a function to generate a random 0-
1 reward with the corresponding probability when a particular
arm is pulled. A single random seed will be passed to your
program; you must seed the random number generator in your
simulation with this seed. If any of your algorithms are
randomised, they must also use the same seed for initialisation.
Given a seed and keeping other input parameters fixed, your
entire experiment must be deterministic: it should execute the
same way and produce the same result. Of course, the execution
will be different for different random seeds; the point being made
is that of repeatability for a given seed. You should be able to
implement this property by initialising all the random number
generators in your program based on the seed provided as input:
you should not leave them unseeded or use strategies such as
seeding based on system time. Make sure you understand this
requirement; if the behaviour of your code does not get fixed by
the input random seed (keeping other input parameters fixed), you
will not receive any marks for the assignment.
Having set up the code to pull arms and generate rewards, you
must implement the following sampling algorithms: (1) epsilongreedy, (2) UCB, (3) KL-UCB, (4) Thompson Sampling, and (5)
Thompson Sampling with Hint. You are free to make assumptions
on unspecified aspects such as how the first few pulls get made,
how ties get broken, how any algorithm-specific parameters are
set, and so on. But you must list all such assumptions in your
report. The only external parameter to the given set of algorithms
is epsilon for epsilon-greedy sampling, which will be passed from
the command line. Recall that on every round, an algorithm can
only use the sequence of pulls and rewards up to that round (or
statistics derived from the sequence) to decide which arm to pull.
Specifically, it is illegal for an algorithm to have access to the
bandit instance itself (although bandit.py has such access.
To give the permutation of the true means for thompson-samplingwith-hint, sort the true means and pass it as an argument to
thompson-sampling-with-hint. That is, use a line of code similar
to the following to generate hint. hint_ls =
numpy.sort(true_mean_ls)
Passed an instance, a random seed, an algorithm, epsilon, and a
horizon, your code must run the algorithm on the instance for
"horizon" number of pulls and note down the cumulative reward
REW. Subtracting REW from the maximum expected cumulative
reward possible (the product of the maximum mean and the
horizon) will give you REG, the cumulative regret for the
particular run. Note that this number can be negative (and might
especially turn up so on small horizons—why?). When the
algorithm terminates, bandit.py should output a single line with
six entries, separated by commas and terminated with a newline
('\n') character. The line must be in this format; outputFormat.txt
21/11/2020 CS 747: Programming Assignment 1
https://www.cse.iitb.ac.in/~shivaram/teaching/cs747-a2020/pa-1/programming-assignment-1.html 4/6
contains a few such lines (in which REG is set to arbitrary values
just for illustration).
instance, algorithm, random seed, epsilon, horizon, REG
We will run your code on a subset of input parameters and
validate the output with an automatic script. You will not receive
any marks for the assignment if your code does not produce
output in the format specified above.
Once you have finished coding bandit.py, run check.sh to make
sure that you correctly read in all the command line parameters,
and print the output as we described above. While testing your
code, we will use a different version of check.sh—with different
parameters—and call it inside your submission directory.
Note that epsilon only needs to be used by bandit.py if the
algorithm passed is epsilon-greedy; for other algorithms, it is a
dummy parameter. Your output must still contain epsilon (either
the value passed to it or any other value) to retain the six-column
format.
T3: Experiments on epsilon-greedy
For the experiments in T3, take the average REG of 50 random
seeds (0 through 49) for the horizon of 102400.
T4: Output Data, Plots, Report
Having written bandit.py, now it's time to generate some data
and compare their performances. Generate data for the code
written in T1 and T2 separately.
To generate data for T1, run bandit.py for every combination of
instance from "../instances/i-1.txt"; "../instances/i2.txt"; "../instances/i-3.txt",
algorithm from epsilon-greedy with epsilon set to 0.02;
ucb, kl-ucb, thompson-sampling,
horizon from 100; 400; 1600; 6400; 25600; 102400, and
random seed from 0; 1; ...; 49.
To generate data for T2, run bandit.py for every combination of
instance from "../instances/i-1.txt"; "../instances/i2.txt"; "../instances/i-3.txt",
algorithm from thompson-sampling, thompson-samplingwith-hint
horizon from 100; 400; 1600; 6400; 25600; 102400, and
random seed from 0; 1; ...; 49.
It is best that you write your own wrapper script for generating
the output for all these input configurations. Place all the output
lines of T1 in a file named outputDataT1.txt and similarly place
21/11/2020 CS 747: Programming Assignment 1
https://www.cse.iitb.ac.in/~shivaram/teaching/cs747-a2020/pa-1/programming-assignment-1.html 5/6
all the output lines of T2 in a file named outputDataT2.txt.
Notice that the files outputDataT1.txt and outputDataT2.txt
must have exactly 3 × 4 × 6 × 50 = 3600 and 3 × 2 × 6 × 50 =
1800 lines respectively. It will take you at least 5 to 10 hours
(depending on your system configuration) to generate data
(especially for longer horizons), and so do not leave this task to
the last minute. Since data for shorter horizons will anyway be
generated as a part of the longer-horizon experiments, you might
be able to save some time by recording intermediate regret
values. However, your submitted bandit.py file must still only
print a single line corresponding to the horizon passed to it.
You will generate six plots: two for each instance (T1 & T2
separately). The plot will have horizon on the x axis (use a log
scale) and regret on y axis. The plot for T1 will contain 4 lines:
one for each algorithm and the plot for T2 will contain two lines:
thompson-sampling and thompson-sampling-with-hint. Each point
will give the average regret from the 50 random runs at the
particular horizon for the algorithm. Make sure you provide a
clear key so the plot is easy to follow.
Include all six graphs in a file called report.pdf, which should
also state any assumptions in your implementation and provide
your interpretations of the results. Also briefly explain your
algorithm for thompson-sampling-with-hint. Feel free to put
down any observations that struck you while working on this
assignment. Do not leave your graphs as separate files: they must
be embedded in report.pdf.
Submission
Place these items in the submission directory.
bandit.py and all the code that it needs to run.
outputDataT1.txt
outputDataT2.txt
report.pdf
references.txt (see the section on Academic Honesty on
the course web page)
Compress the directory into submission.tar.gz and upload it on
Moodle under Programming Assignment 1.
Evaluation
We will evaluate you based on your report, and also run your
code to validate the results reported. If your code does not run on
cs747 docker container or your report is absent/incomplete, you
will not receive any marks for this assignment.
21/11/2020 CS 747: Programming Assignment 1
https://www.cse.iitb.ac.in/~shivaram/teaching/cs747-a2020/pa-1/programming-assignment-1.html 6/6
5 marks reserved for correctness of your implementation in T1
and, the report containing the plots and interpretation of the
results of the algorithms in T1. (1 mark each for epsilon-greedy
and UCB, and 1 1/2 each for KL-UCB and Thompson Sampling.
Please note: Absence of the plots & implementation details of the
algorithm will lead to zero marks for that particular algorithm.)
4 marks for T2. We will evaluate based on its performance
against regular Thompson Sampling, and the report which
contains the plots and brief description of your algorithm and
your observations.
1 mark for T3. Give your response in report.pdf . No need to
submit any graphs.
The TAs and instructor may look at your source code and notes to
corroborate the results obtained by your agent, and may also call
you to a face-to-face session to explain your code.
Deadline and Rules
Your submission is due by 11.55 p.m., Friday, September 25.
Finish working on your submission well in advance, keeping
enough time to generate your data, compile the results, and
upload to Moodle.
Your submission will not be evaluated (and will be given a score
of zero) if it is not uploaded to Moodle by the deadline. Do not
send your code to the instructor or TAs through any other
channel. Requests to evaluate late submissions will not be
entertained.
Your submission will receive a score of zero if your code does not
execute on cs747 docker container. To make sure you have
uploaded the right version, download it and check after
submitting (but before the deadline, so you can handle any
contingencies before the deadline lapses).
You are expected to comply with the rules laid out in the
"Academic Honesty" section on the course web page, failing
which you are liable to be reported for academic malpractice.