$30
Project 5: Web Benchmark and Spark
Overview
An important part of any Web application development is to provision our Web site for the expected
load from the users. To do this, we need to be able to measure how many concurrent user requests a
single Web server can handle, so that we can estimate how many server instances are needed to
handle the expected load. It is also important to be able to run any complex data processing tasks on
a clusters of machines available to us, so that we can provide a near-time feedback from our user
data to improve our application.
In Part A, we would focus on measuring the number of concurrent requests that a system can
handle. More precisely, we will look into measuring how many users would get the response within a
reasonable time period.
In Part B, we would learn how we can use the popular Apache Spark engine to run a complex data
processing tasks on a clusters of machine in parallel.
Development Environment
The main development of Project 5 will be done using a new docker container named “spark”
created from “junghoo/cs144-spark”. Use the following command to create and start this new
container:
$ docker run -it -p4040:4040 -p8089:8089 -v
{host_shared_dir}:/home/cs144/shared --name spark junghoo/cs144-spark
Make sure to replace {host_shared_dir} with the name of the shared directory on your host. The
above command creates a docker container named spark with appropriate port forwarding and
directory sharing.
Our new container has Locust (v0.9.0), JDK (v1.8.0), Scala (v2.12.2), and Spark (v2.2.0) preinstalled. You can check their versions by the following commands.
$ locust --version
$ java -version
$ scala -version
$ spark-shell --version
As before, the default username inside the container is “cs144” with password “password”.
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 2/13
When you perform the tasks for Parts A and B of Project 5, you also need to run the servers you
developed in Projects 2 and 3. Make sure that your servers in the “tomcat” and “mean” containers
are still available and run well by executing the following sequence of commands inside the host:
$ docker start tomcat
$ docker start mean
$ docker exec -d mean npm start --prefix {your_blog_server_dir}
Replace {your_blog_server_dir} with the path to your Project 3 directory inside the mean
container, e.g., /home/cs144/shared/project3/blog-server. Make sure that your Tomcat
server is available at http://localhost:8888 and your node server is available at
http://localhost:3000 on your host machine.
By default, each container runs in its own isolated environment and can reach other containers only
if it knows their IP addresses. In this project, we want to make the containers reachable by their
container names, not just by IPs, so that we can send HTTP requests between containers more
easily. This can be done by creating a custom bridge network and connecting our three containers to
the bridge network. Run the following command inside the host to create a custom bridge network,
named cs144-net:
$ docker network create cs144-net
Once a bridge network is created, any running container can be connected to it through the docker
network connect command:
$ docker network connect cs144-net spark
$ docker network connect cs144-net tomcat
$ docker network connect cs144-net mean
The above sequence of commands will connect our three containers, “spark”, “tomcat”, and “mean”,
to the bridge network “cs144-net”. (Make sure that the three containers are already running before
you execute the above commands.)
Now the three containers can reach each other through their names. To verify, execute a shell in the
“spark” container and run curl commands to issue HTTP requests to other containers like the
following:
$ docker exec --user cs144 -it spark /bin/bash
cs144@e6a142ac3bf9:~$ curl http://mean:3000
cs144@e6a142ac3bf9:~$ curl http://tomcat:8080
The first docker exec command runs a shell in the spark container, so that the next commands are
executed from the container, not from the host. The next two curl commands, therefore, issue
HTTP requests from the spark container to the mean and tomcat containers. As long as the mean
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 3/13
and tomcat containers have started and are running, you should see correct responses from the
servers of the other two containers. Make sure that this is the case by checking the responses. From
now on, you should use the http://mean:3000 and http://tomcat:8080 within spark container
to access other containers.
Once you finish checking the containers and their network connectivity, you can stop them using the
following command:
$ docker stop spark
$ docker stop mean
$ docker stop tomcat
From now on, all three containers can talk to each other through their names!
Part A: Testing Performance of Server(s)
In this part, you are require to (1) learn how to use Locust to test the performance of your server, (2)
write some customized files that represent user behavior, (3) record the results of test cases, and (4)
find the maximum number of users that the server could handle under some performance
requirements.
Learn to Use Locust
You must go over the following Locust tutorial, step-by-step, before moving on:
Quick Locust Tutorial
Before moving forward, make sure you KILL the simple server so that the request you send later
is not responded by the fake server.
Mock Data Preparation
Before we start load testing our servers using Locust, we need to load some mock data into our
databases. We prepared two scripts, mock_data_tomcat.sh and mock_data_node.sh for this
purpose. Download the tomcat script inside the “tomcat” container and execute it like the
following:
$ wget http://oak.cs.ucla.edu/classes/cs144/project5/mock_data_tomcat.sh
$ bash ./mock_data_tomcat.sh
Make sure that MySQL and Tomcat servers are up and running inside the container before you run
the above script.
Do the same for the node script inside the “mean” container:
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 4/13
$ wget http://oak.cs.ucla.edu/classes/cs144/project5/mock_data_node.sh
$ bash ./mock_data_node.sh
Again, make sure that MongoDB server is running inside the container before you run the above
script.
Now our databases have been populated with 500 fake blog posts by the user cs144.
Write Locust Files
Once we have all the test data inserted, it’s time for you to write locust files to load test the servers.
In particular, you are required to write 6 files to perform the following tests:
Locust File for Tomcat Server
1. read_tomcat.py
In this file, we are simulating the scenario where all requests from users are read intensive.
The user whose name is cs144 would randomly open one of his posts via /editor/post?
action=open&username=cs144&postid={num}, where {num} is a random postid.
Note: In this test file, use /editor/post?action=open as the name for the requests.
Also, make sure that postid that user opens should be between 1 and 500. Since our user
“cs144” only has 500 posts, he will get nothing otherwise!
2. write_tomcat.py
In this test file, we are simulating the scenario where the requests from users are write
intensive. The user cs144 would modify one of his posts randomly by changing the title to
“Loading Test” and the body to “***Hello World!***” via /editor/post?
action=save&username=cs144&postid=
{num}&title=Loading%20Test&body=***Hello%20World!***. Replace {num} with a
random number between 1 and 500.
Note: Use /editor/post?action=save as the name for this request. You may use only
POST method here since GET method is not supported.
3. mixed_tomcat.py
In this test file, we are simulating a more realistic scenario where some users are reading
posts while others are writing. In this test, 10% of users are write intensive while the
remaining 90% are read intensive.
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 5/13
Note: Just “cut and paste” the tasks/functions you wrote in previous files and combine
them with different weights. 10% of user tasks must come from the “write task” defined in
write_tomcat.py and the remaining 90% of user tasks must come from the “read task”
defined in read_tomcat.py.
Locust File for Node Server
4. read_node.py
In this test file, we are simulating a similar behavior of read_tomcat.py except that now
we are testing the performance of our Node.JS server. The user cs144 would randomly
open one of her posts via /blog/:username/:postid API. Again, remember to limit the
postid between 1 and 500 when you pick one randomly.
Note: Use /blog/cs144 as the name for this request.
5. write_node.py
In this file, we are going test the server performance under write intensive requests as we
did in write_tomcat.py. The user cs144 randomly update one of her posts by changing
its title to “Loading Test” and body to “***Hello World!***” through
/api/:username/:postid. Remember that you need to obtain the JWT authentication
token through the login page before calling this API. Limit the postid to between 1 and 500.
Note: Use /api/cs144 as name for the request.
6. mixed_node.py
In this file, we combine the read intensive tasks and write intensive ones in a single file as
you would expect. The percentage would remain the same as 10% write and 90% read.
Note: Just reuse the tasks/functions you wrote in previous files and combine them with
different weights.
Run Locust Files and Save Results
Part 1
Once you fish writing the six locust files, run Locust without the web UI using the following four
files: read_tomcat.py, write_tomcat.py, read_node.py, and write_node.py. Once they finish
running, fill in the performance.json file with your results, where the field values are 0. More
precisely, you should fill in the fields, “Number of Users” according to the configuration parameters
you set in the command, and, “Total Number Of Req”, “Total RPS” and “response time for 98% of
the requests”, with the numbers that you get from the final summary of your tests.
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 6/13
Note: The filled values should be numbers, not a string. For example, if your total RPS is 160
req/s, just put the number 160, like “Total RPS”: 160, NOT like “Total RPS”: “160 req/s”. Our
grading script won’t correctly recognize any value other than numbers. Do NOT change the any
other part of the performance.json file.
Part 2
After filling in the performance.json, you should be now familiar with the testing procedure. Let’s try
to find the maximum users that the server could handle under 10%-write and 90%-read load using
mixed_tomcat.py and mixed_node.py files. In particular, we require that the servers must
return at least 98% responses within less than 400ms for all requests except login
requests. That is, each URL name group except /login?... should return 98% of responses in
less than 400ms.
Your task is to find the maximum number of users (in the unit of hundreds like 300 if the number if
higher than 100, in the unit of tens like 60 if the number if higher than 10, and in the unit of 1 if it is
below 10) that the servers can handle under this requirement. Use summary_tomcat.txt and
summary_node.txt to save the results from the tests when the server met this requirement with the
largest number of users. Please save only the summary in your result files by using --
only-summary.
After Part A is done, you should have 6 test files (read_tomcat.py, write_tomcat.py,
mixed_tomcat.py, read_node.py, write_node.py, mixed_node.py), 1 JSON file (performance.json)
and 2 summary files (summary_tomcat.txt, summary_node.txt).
Note 1: Sometimes you might see fails caused by an error like
ConnectionError(ProtocolError('Connection aborted.', error(104, 'Connection
reset by peer')),). It is very likely an indication that the number of users is more than that
your server could handle, not necessarily something is wrong in your implementation.
Note 2: There is no right or wrong results for performance stats in Part A as long as your test files
are correctly implemented. You just need to submit the results that you get from your performance
benchmark.
Note 3: Now that you are done with Part A, think about the performance of two servers in each
scenario. Which one exhibited a better performance? Tomcat or Node.JS? Note that our comparison
may be a bit unfair to the Node.JS server because (1) Tomcat did not “render” markdown to HTML
when a blog post is “opened” and (2) Node.JS had to perform the extra authentication verification
step for the “write” tasks. You may want to keep this difference in mind when you interpret your
results. You do not need to provide an answer to this question in your submission. This question is
to encourage you to think more deeply about what you observed from load testing.
Part B: Apache Spark
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 7/13
In this part, you will learn how to use the popular Apache Spark engine to perform a (potentially
heavy) computational task on a large number of machines using the Map-Reduce framework. In
particular, you will identify the most “popular” users on the Twitter network, measured by the
number of their followers. Our new “spark” container has the Spark engine preinstalled. We also
provide a snapshot of the follower-following graph of Twitter. Your job is to write a (simple) code on
Spark that returns the IDs of the users with the high follower counts.
Twitter Graph File and Our Task
Download the twitter.edges file that contains a snapshot of the follower-following graph structure of
Twitter:
twitter.edges: Twitter follower-following graph data
Each line of the file represents the “following” edges from a particular Twitter user in the format
below:
user1: user2,user3,...,userk
The above line indicates that user1 is “following” user2 through userk. Note that each user in the file
is represented as a unique random integer. For example, the first line of the file:
138134695: 62902439,15224867
indicates that the user “138134695” is following two other users, 62902439 and 15224867.
Given this file, it is relatively straightforward to find the user who follows the largest number of
users. We simply need to identify the line with the largest of user IDs behind colon. Unfortunately,
our task is more complex. We need to identify the users who are followed by a large
number of other users (more precisely, 1,000 other users). While our dataset is reasonably
small – it is only 21 MB in size – you can imagine that this dataset can potentially be huge, so we
want to implement this task using the Apache Spark Engine, so that we can perform this task in
parallel on as many available machines as possible.
Apache Spark
Writing a program that runs on a large number of machines in parallel can be a daunting task. To
make this task easier, a number of distributed software infrastructures have been developed. In
particular, as we learned in class, Map-Reduce framework asks the programmer to provide just the
core computational logic of the given task as a set of Map and Reduce functions. Given this core
functions, Map-Reduce framework takes care of the rest, including data distribution, parallel
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 8/13
execution, process monitoring, and result shuffling and collection. Apache Spark is a popular open
source software that provides a Map-Reduce style programming environment on a cluster with a
large number of distributed machines.
Once you are inside our spark container, you can run the Spark interactive shell by executing the
spark-shell command:
$ spark-shell
Spark context Web UI available at http://172.17.0.2:4040
Spark context available as 'sc' (master = local[*], app id = local1518992867666).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151)
Type in expressions to have them evaluated.
Type :help for more information.
scala
Inside the Spark shell, you can execute any scala command using Spark API. You can exit from the
Spark interactive shell by pressing “Control+D” key.
Now that we know how to start the Apache Spark Shell, it is time to learn how to use it by going over
one of widely available Apache Spark tutorials on the Internet. For example, the official Quick Start
Tutorial provides a ten-minute introduction to essential basics. The file wordCount.scala also
contains the example code that we went over in the class:
Again, the above code computes the frequency of each word in the text file input.txt and generates
(word, frequency) pairs as the output. When saveAsTextFile("output") is called, the program
creates a new subdirectory named output, where part-xxxxx file(s) are generated that contain
the (word, frequency) output pairs in wordCounts. Note the last line in the wordCount.scala
val lines = sc.textFile("input.txt")
val words = lines.flatMap(line = line.split(" "))
val word1s = words.map(word = (word, 1))
val wordCounts = word1s.reduceByKey((a,b) = a+b)
wordCounts.saveAsTextFile("output")
System.exit(0)
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 9/13
script: System.exit(0). Calling the system exit function ensures that once the script finishes, the
interactive shell is aborted as well. You can execute this file using the Spark shell using the following
command:
$ spark-shell -i wordCount.scala
...
$ ls -l output/
total 8
-rwxr-xr-x 1 root root 0 Mar 16 03:18 _SUCCESS
-rwxr-xr-x 1 root root 2131 Mar 16 03:18 part-00000
-rwxr-xr-x 1 root root 2056 Mar 16 03:18 part-00001
$ head output/part-00000
(country,2)
(House,2)
(its,1)
(previously,1)
(countries,,1)
(have,2)
(policy,1)
(travel,,1)
(order,7)
(Trump’s,2)
Note that the provided wordCount.scala code is written to be executed in an interactive shell only.
If we want to properly run our program on multiple machines in a Spark cluster, we need to “wrap”
this code within an object with a “main” function. We also need to create proper “Spark
Configuration” and “Spark Context” in which our program will run. For example,
wordCountFull.scala shows an extended version of our wordCount program that can be properly
compiled into an executable “package”, say word-count-project.jar using a tool chain such as
sbt. Once packaged, it can be submitted to a Spark cluster for parallel execution using a command
like the following:
$ spark-submit --class edu.ucla.cs144.WordCount --master
spark://23.195.26.187:7077 --deploy-mode cluster word-count-project.jar
Since the goal of this project is to introduce you to the main programming paradigm/API of Spark,
not the nitty gritty details of the Spark packaging tool chain and job submission interface, in this
project we will assume that your code will be executed through a Spark interactive shell using
the command spark-shell -i.
While Spark supports multiple languages, Scala is the most popular (and syntactically clean)
language to program on Spark, so we use Scala for this project. Basic Scala needed to complete
this project is easy to learn and your code will be much cleaner than when you use other languages.
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 10/13
There exist many quick online tutorials on Scala, such as this one. Fortunately, this project can be
completed without using any advanced Scala constructs like class inheritance. In particular, we note
that anonymous functions in Scala are similar to “arrow functions” in JavaScript, which can be used
to pass “Map-Reduce” functions.
Writing Your Code
Now that you got the basics, it is time to write code. Your code must read the text file twitter.edges
located in the current directory, parse it to obtain the Twitter follower-following graph, perform
necessary computation, and return the list of all (userid, follower_count) pairs for the users with
more than 1000 followers. The output from your code should contain many lines of (userid,
follower_count) pairs like:
(40981798,8569)
(43003845,7621)
...
The first two lines of the above output, for example, indicate that the users 40981798 and 43003845
have 8569 and 7621 followers, respectively. The output (userid, follower_count) pairs should be
saved as a (set of) text file(s) in the “output” directory using the Spark saveAsTextFile()
function. The output (userid, follower_count) pairs may appear in any order and need
not be sorted.
In writing your code, you may find the list of Spark transformation functions helpful. Also, if you
need a “hint” on parsing the provided twitter.edges file, you may find this question and answer at
StackOverflow helpful (local mirror is available here).
Before we finish, we reiterate the essential requirements of your code.
Code Requirements
1. Your code should read the twitter graph from the file “twitter.edges” located in the current
directory.
2. The output from your code should be the list of (userid, follower_counts) pairs for all the users
whose follower count is larger than 1000. The output does not have to be sorted.
3. Your code should save the output in the “output” subdirectory within the current working
directory using saveAsTextFile() of Spark RDD.
4. Your code should compute the final results using Map-Reduce-style programming by applying a
series of Spark transformation functions to the input dataset.
Testing Your Code
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 11/13
Before submitting your code, thoroughly test your code so that it computes the correct results. To
help you ensure that your code produces the correct output, here are a few sample results from our
dataset:
User ID Follower Count
40981798 8569
3359851 3905
88323281 2315
18742444 1585
9451052 1184
302847930 1182
12925072 1002
In total, there are 177 users with more than 1000 followers.
What to Submit
For this project, you need to submit a single zip file named project5.zip that has the following
packaging structure.
project5.zip
+- read_tomcat.py, write_tomcat.py, mixed_tomcat.py
+- read_node.py, write_node.py, mixed_node.py
+- performance.json
+- summary_tomcat.txt, summary_node.txt
+- topUsers.scala
+- TEAM.txt
+- README.txt (optional)
Each file or directory is as following:
1. read_tomcat.py, write_tomcat.py, mixed_tomcat.py, read_node.py, write_node.py,
mixed_node.py: These are the six locust files to load test your tomcat and node servers
2. performance.json: This is the JSON file that contains the load test results.
3. summary_tomcat.txt, summary_node.txt: These are the summary table from the two mixed
benchmark tests of Part A-1
4. topUsers.scala: this is the main Scala code that you wrote to compute the top Twitter users.
This code should be executable simply by typing “spark-shell -i topUsers.scala”. Please
DO NOT submit any input or output files for your code. Just submit your main Scala script.
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 12/13
5. TEAM.txt: This file must include the 9-digit university ID (UID) of every team member, one
UID per line. No spaces or dashes. Just 9-digit UID per line. If you are working on your own,
include just your UID.
6. README.txt includes any comments you find worth noting, regarding your code structure, etc.
All files should be contained directly under the project5.zip (without any enclosing folders).
Testing Your Zip File
To ensure the correct packaging of your submission, we have made a grading script p5_test for
Project 5, which can be executed like:
$ ./p5_test project5.zip
(Add the appropriate path to the project5.zip if needed. You may need to use “chmod +x p5_test” if
there is a permission error.)
You MUST test your submission using the script to minimize the chance of an unexpected error
during grading. When everything runs properly, you will see an output similar to the following from
the script:
Executing your Spark code.....
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = local1488946788877).
Spark session available as 'spark'.
Loading topUsers.scala...
...
(20,1010)
(99,1010)
(10,1010)
SUCCESS! We finished testing your zip file integrity.
Once your work is properly packaged as a zip file, submit your zip file via our submission page at
CCLE.
You may submit as many times as you like, however only the latest submission will be saved, and
those are what we will use for grading your work and determining late penalties.
Grading Criteria
12/13/2018 index
http://oak.cs.ucla.edu/classes/cs144/project5/index.html 13/13
Overall grading breakdown is as below
Part A: 6 test files runs without being terminated by any error (20%)
Part A: performance.json contains your real test results (10%)
Part A: the data and settings in 2 summary files meet the requirements (20%)
Part B: Submitted code runs without any error and produces output (15%)
Part B: Submitted code computes correct answer on the provided dataset (15%)
Part B: Submitted code computes correct answer on different datasets with the same format
(20%)