Starting from:

$29

Homework 3 GridWorld


Problem description
GridWorld is a 2D rectangular grid of size with an agent starting off at N , ) ( rows Ncolumns one grid cell, moving from cell to cell through the grid, and eventually exiting after collecting a reward. This grid environment is described as follows: ● State space: GridWorld has distinct states. We use to denote Nrows ×Ncolumns s the state. The agent starts in the bottom-left cell (row 1, column 1, marked as a green cell). There exist one or more terminal states (blue cells) that can be located anywhere in the grid (except the bottom-left cell). There may also be walls (red cells) that the agent cannot be moved to. An instance of this grid is shown in the figure below.
● Actions: ​At every non-terminal state, the agent can either walk or run in any of the four directions (Up, Down, Left, and Right), which results in 8 possible actions: “Walk Up”, “Walk Down”, “Walk Left”, “Walk Right”, “Run Up”, “Run Down”, “Run Left”, “Run Right”. At the terminal state, the only possible action is “Exit”. We use to denote the set of all possible actions at state . (s)A s
● Transition model​: GridWorld is stochastic because the actions can be unreliable. In this environment, action “Walk ​X ​ ” (​X can be Up, Down, Left, or Right) moves the agent one cell in the X direction with probability , but with pwalk probabilities and moves the agent one cell at angles ) 0.5 (1 p − walk ) 0.5 (1 p − walk of 90° and -90° to the direction ​X ​ , respectively.
Furthermore, action “Run ​X ​ ” moves the agent two cells in the ​X direction with probability , but with probabilities and moves the prun ) 0.5 (1 p − run ) 0.5 (1 p − run agent two cells at angles of 90° and -90° to the direction ​X ​ , respectively.
If moving in a particular direction causes the agent to bump into a wall, the movement fails, and the agent stays in the same cell . We write to denote the i, " " j (s|s, ) P ′ a probability of reaching state if action is done in state . The following examples s′ a s illustrate the environment dynamics: ○ Assume that the agent chooses action “Walk Up” at “4,4” as shown in figure below. This action moves the agent to (5,4) with probability , pwalk but with probability , it moves the agent right to “4,5”, and ) 0.5 (1 p − walk with probability , it moves the agent left to "4,3”. )0 .5 (1 p − walk
○ Assume that the agent chooses action “Run Up” at “4,4” as shown in figure below. This action moves the agent two cells up, but because it causes the agent to bump into a wall, the agent stays at “4,4” with probability . With probability , the agent moves two cells prun ) 0.5 (1 p − run right to “4,6”. Finally, this action moves the agent two cells left with probability , but because of the wall at “4,2”, the agent stays ) 0.5 (1 p − run at “4,4” with probability . Hence, as a result of this action, the ) 0.5 (1 p − run agent moves to “4,6” with probability and stays at “4,4” with ) 0.5 (1 p − run probability . )p run +0.5 (1 p − run

○ Assume that the agent chooses action “Walk Right” at “4,1” as shown in figure below. Then, the agent moves to “5,1” and “3,1”, each with probability and stays at “4,1” with probability . )0 .5 (1 p − walk pwalk
○ Assume that the agent chooses action “Run Right” at “4,1” as shown in figure below. Then, the agent moves to “2,1” with probability ) 0.5 (1 p − run and stays at “4,1” with probability . )p run +0.5 (1 p − run
● Rewards: ​When the agent takes action​a
​ in state , it receives a reward, . s (s, ) R a For all non-terminal states, ​s ​ : ○ R ​ (​s, ​ Walk ​X ​ ) ​= (a constant). Rwalk ○ R ​ (​s, ​ Run ​X ​ ) ​= ​ (a constant). Rrun Furthermore, if there are terminal states, the reward in the​k ​ th terminal state,​s K is ​R ​ (​s, ​ Exit) = (a constant). (k)R terminal
The agent prefers receiving its reward as soon as possible. Therefore, it applies a discount factor, , to future rewards. For example, if it runs once and then exits from the γ k ​ th terminal state, it will receive a total reward of If it instead walks R (k). Rrun +γ terminal twice and then exits from the ​k ​ th terminal state, it will receive a total reward of R γ R (k).R walk +γ walk + 2 terminal
Your Task In this assignment, you need to find the action that the agent should take at each state to maximize its expected reward. ​You can use any algorithm for this homework. Implementation Notes 1. Ties between the possible actions are broken based on the following preference order: “Walk Up” “Walk Down” “Walk Left” “Walk Right” “Run Up” “Run Down” “Run Left” “Run Right”. 2. The grid will contain at most 1,000,000 cells. 3. As the wall states are never reached, it does not matter what the optimal actions are at these states. However, for consistency, you should print “None” as the optimal action for a wall state.
File Formats Input format:
<GRID SIZE includes two positive integers separated by a comma “,” where the first one is the number of rows ( ) of the grid, and the second one is the number of Nrows columns ( ) of the grid. Ncolumns <WALL CELLS NUMBER ​is a number (greater than or equal to zero) that specifies the number of wall cells in the grid. <WALL CELLS POSITION ​contains <WALL CELLS NUMBER lines where each line includes 2 values separated by a comma “,”. The two values in each line indicate the row and column numbers of the gird where a wall is located. <TERMINAL STATES NUMBER ​is a number (greater than or equal to one) that specifies the number of terminal states in the grid. <TERMINAL STATES POSITION and REWARDS ​contains <TERMINAL STATES NUMBER lines where each line includes 3 values separated by a comma “,”. The first two values in the -th line indicate the row and column numbers of the grid where the k k ​ th​ terminal state is located, and the third value (a float) denotes . (k)R terminal <TRANSITION MODEL ​contains two probability values separated by a comma “,”, where the first value is and the second value is . pwalk prun <REWARDS contains two floating-point numbers separated by a comma “,” where the first value is and the second value is Rwalk . Rrun <DISCOUNT FACTOR​ which is a number in the interval [0,1].
Output format:
<OPTIMAL ACTION GRID contains lines where each line includes Nrows Ncolumns strings separated by a comma “,”. The -th string at the -th line should indicate the j i optimal action (one of “Walk Up”, “Walk Down”, “Walk Left”, “Walk Right”, “Run Up”, “Run Down”, “Run Left”, or “Run Right” for non-terminal states; “Exit” for terminal states; or “None” for wall states) at state “ , ”. N rows −i +1 j
Grading Your homework submission will be tested as follows: Your program should take no command-line arguments. It should read a text file called “input.txt” in the current directory that contains a problem definition. It should write a file “output.txt” with your solution. The formats for files “input.txt” and “output.txt” are specified below. End-of-line convention is Unix (because Vocareum is a Unix system).
During grading, 50 test cases will be used. If your homework submission passes all 50 test cases, you receive 100 points. For each test case that fails, you will lose 2 points.
Note that if your code does not compile, or somehow fails to load and parse input.txt, or writes an incorrectly formatted output.txt, or no output.txt at all, or OuTpUt.TxT, you will get zero points.
Sample Test Cases You are provided with 3 sample test cases and outputs. Please find them in the Homework 3 folder.
Homework Rules 1. You must use ​Python 2.7 to implement your homework assignment. You are allowed to use standard libraries only. You have to implement any other functions or methods by yourself. 2. You need to create a file named “hw3cs561s2018.py”. When you submit the homework on labs.vocareum.com, the following commands will be executed: python hw3cs561s2018.py 3. Your code must create a file named “output.txt” and print its output there. For each test case, the grading script will put an “input.txt” file in your work folder, runs your program (which reads “input.txt”), and check the “output.txt” file generated by your code. The grading script will replace the files automatically, so you do NOT need to do anything for that part.
4. Homework should be submitted through Vocareum. Homework submitted through email, in person during office hours, or through any channel other than Vocareum will not be accepted. Please only upload your code to the “/work” directory. Don’t create any subfolder or upload any other files. Please refer to http://help.vocareum.com/article/30-getting-started-students to get started with Vocareum. 5. Your program should handle all test cases within a reasonable time. A maximum runtime of ​3 minutes​ per test case will be set on Vocareum. 6. You are strongly recommended to submit at least two hours ahead of the deadline, as the submission system around the deadline might be slow and possibly cause late submission.

More products