$29
%load_ext sql
%sql sqlite:///flights.db
Problem Set 1
Instructions / Notes:
Read these carefully
You will need to install the prettytable module to run the scripts. (eg. pip install --user prettytable)
Download flights.db from http://web.stanford.edu/class/cs145/files/flights.db and make sure it is in the same directory as this Jupyter notebook
Run the top cell above to load the database flights.db.
You may create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!
When you see In [*]: to the left of the cell you are executing, this means that the code / query is running.
If the cell is hanging- i.e. running for too long: to restart the SQL connection, you must restart the entire python kernel
To restart the kernel using the menu bar: "Kernel >> Restart & Clear Output"), then re-execute the sql connection cell at the top
You will also need to restart the connection if you want to load a different version of the database file
Remember:
%sql [SQL] is for single line SQL queries
%%sql
[SQL] is for multi line SQL queries
Running submit.py will run the queries and print them.
We have provided correct output from our solution in correct_output.txt.
You can diff your output with the correct output by running python sanity_check.py or python submit.py > my_output; diff my_output correct_output.txt
Your submit.py should match this output exactly. This means:
the columns should have the same names as correct_output.txt
the columns should be in the same order as correct_output.txt
Submission Instructions:
Do NOT submit your iPython notebook.
Instead, copy each query into the provided submit.py script in the corresponding variable.
Do not copy the %sql or %%sql commands, only the raw SQL.
We will autograde your queries on a database with identical schema but arbitrarily-chosen values. Therefore, do not hard-code any constants in your queries -- all of your queries should use the tables in the database.
Follow the instructions in submission_instructions.txt to upload your answers to our autograders.
Have fun!
Introduction: Travel Delays
There's nothing I dislike more than travel delays -- how about you?
In fact, I'm always scheming new ways to avoid travel delays, and I just found an amazing dataset that will help me understand some of the causes and trade-offs when traveling.
I wonder if you can use SQL to help me!
Not surprisingly... you can!
In this homework, we'll use SQL to explore airline travel delays that occurred in July 2017.
To start, let's look at the primary relation in the database we've prepared for you:
%%sql
SELECT *
FROM flight_delays
LIMIT 1;
Done.
year quarter month day_of_month day_of_week fl_date unique_carrier airline_id carrier tail_num fl_num origin_airport_id origin_airport_seq_id origin_city_market_id origin origin_city_name origin_state_abr origin_state_fips origin_state_nm origin_wac dest_airport_id dest_airport_seq_id dest_city_market_id dest dest_city_name dest_state_abr dest_state_fips dest_state_nm dest_wac crs_dep_time dep_time dep_delay dep_delay_new dep_del15 dep_delay_group dep_time_blk taxi_out wheels_off wheels_on taxi_in crs_arr_time arr_time arr_delay arr_delay_new arr_del15 arr_delay_group arr_time_blk cancelled cancellation_code diverted carrier_delay weather_delay nas_delay security_delay late_aircraft_delay unnamed: 55
2017 3 7 1 6 2017-07-01 AS 19930 AS N559AS 1 11278 1127803 30852 DCA Washington, DC VA 51 Virginia 38 14747 1474703 30559 SEA Seattle, WA WA 53 Washington 93 800 750.0 -10.0 0.0 0.0 -1.0 0800-0859 17.0 807.0 1010.0 11.0 1053 1021.0 -32.0 0.0 0.0 -2.0 1000-1059 0.0 None 0.0 None None None None None None
Cool, there are so many columns! How many rows are there?
%%sql
SELECT COUNT(*) AS num_rows
FROM flight_delays
Done.
num_rows
509070
Wow, that's a lot of data! Good thing you don't have to answer all of my questions by hand...
You don't need to import more data into the database. However, you can find a description of each field online at https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236.
We've pre-loaded a number of additional tables that will help you decode important fields like airline_id, airport_id, and day_of_week into human-readable form for the queries below.
Please use the following cell to explore these the airlines and weekdays tables:
%%sql
Alright -- let's get started!
Part I: SQL Queries
Query 1: How long are flights delayed on average? (10 points)
Just to get a sense of the data, let's start with a simple query.
In the cell below, write a SQL query that returns the average arrival delay for the entire month of July 2017 (i.e., the whole dataset).
%%sql
Query 2: What was the worst flight delay? (10 points)
Hmm, the average doesn't look too bad! What about the worst delay?
In the cell below, write a SQL query that returns the maximum arrival delay for the entire month of July 2017 (i.e., the whole dataset).
%%sql
Query 3: What flight am I happiest I didn't take? (10 points)
Yikes! What flight was so late?
In the cell below, write a SQL query that returns the carrier (i.e., carrier), flight number, origin city name, arrival city name, and flight date for the flight with the maximum arrival delay for the entire month of July 2017. Do not hard-code the arrival delay you found above. Hint: use a subquery.
%%sql
Query 4: Which are the worst days to travel? (10 points)
Since CS145 just started, I don't have time to head to Kona anytime soon. However, I'm headed out of town for a trip next week! What day is worst for booking my flight?
In the cell below, write a SQL query that returns the average arrival delay time for each day of the week, in descending order. The schema of your relation should be of the form (weekday_name, average_delay).
Note: do not report the weekday ID. (Hint: look at the weekdays table and perform a join to obtain the weekday name.)
%%sql
Query 5: Which airlines that fly out of SFO are delayed least?
Now that I know which days to avoid, I'm curious which airline I should fly out of SFO. Since I haven't been told where I'm flying, please just compute the average for the airlines that fly from SFO.
In the cell below, write a SQL query that returns the average arrival delay time (across all flights) for each carrier that flew out of SFO at least once in July 2017 (i.e., in the current dataset), in descending order.
Note: do not report the airlines ID. (Hint: a subquery is helpful here; also, look at the airlines table and perform a join.)
%%sql
Query 6: What proportion of airlines are regularly late?
Yeesh, there are a lot of late flights! How many airlines are regularly late?
In the cell below, write a SQL query that returns the proportion of airlines (appearing in flight_delays) whose flights are on average at least 10 minutes late to arrive. Do not hard-code the total number of airlines, and make sure to use at least one HAVING clause in your SQL query.
Note: sqlite COUNT(*) returns integer types. Therefore, your query should likely contain at least one SELECT CAST (COUNT(*) AS float) or a clause like COUNT(*)*1.0.
%%sql
Query 7: How do late departures affect late arrivals?
It sure looks like my plane is likely to be delayed. I'd like to know: if my plane is delayed in taking off, how will it affect my arrival time?
The sample covariance provides a measure of the joint variability of two variables. The higher the covariance, the more the two variables behave similarly, and negative covariance indicates the variables indicate the variables tend to be inversely related. We can compute the sample covariance as:
Cov(X,Y)=1n−1∑i=1n(xi−x^)(yi−y^)
where xi denotes the i th sample of X , yi the i th sample of Y , and the mean of X and Y are denoted by x¯ and y¯ .
In the cell below, write a single SQL query that computes the covariance between the departure delay time and the arrival delay time.
Note: we could also compute a statistic like the Pearson correlation coefficient here, which provides a normalized measure (i.e., on a scale from -1 to 1) of how strongly two variables are related. However, sqlite doesn't natively support square roots (unlike commonly-used relational databases like PostgreSQL and MySQL!), so we're asking you to compute covariance instead.
%%sql
Query 8: It was a bad week...
Which airlines had the largest absolute increase in average arrival delay in the last week of July (i.e., flights on or after July 24th) compared to the previous days (i.e. flights before July 24th)?
In the cell below, write a single SQL query that returns the airline name (not ID) with the maximum absolute increase in average arrival delay between the first 23 days of the month and days 24-31. Report both the airline name and the absolute increase.
Note: due to sqlite's handling of dates, it may be easier to query using day_of_month.
Note 2: This is probably the hardest query of the assignment; break it down into subqueries that you can run one-by-one and build up your answer subquery by subquery.
Hint: You can compute two subqueries, one to compute the average arrival delay for flights on or after July 24th, and one to compute the average arrival delay for flights before July 24th, and then join the two to calculate the increase in delay.
%%sql
Query 9: Of Hipsters and Anarchists
I'm keen to visit both Portland (PDX) and Eugene (EUG), but I can't fit both into the same trip. To maximize my frequent flier mileage, I'd like to use the same flight for each. Which airlines fly both SFO -> PDX and SFO -> EUG?
In the cell below, write a single SQL query that returns the distinct airline names (not ID, and with no duplicates) that flew both SFO -> PDX and SFO -> EUG in July 2017.
%%sql
Query 10: Decision Fatigue and Equidistance
I'm flying back to Stanford from Chicago later this month, and I can fly out of either Midway (MDW) or O'Hare (ORD) and can fly into either San Francisco (SFO), San Jose (SJC), or Oakland (OAK). If this month is like July, which leg will have the shortest arrival delay for flights leaving Chicago after 2PM local time?
In the cell below, write a single SQL query that returns the average arrival delay of flights departing either MDW or ORD after 2PM local time (crs_dep_time) and arriving at one of SFO, SJC, or OAK. Group by departure and arrival airport and return results descending by arrival delay.
Note: the crs_dep_time field is an integer formatted as hhmm (e.g. 4:15pm is 1615)
%%sql
You're done! Now submit!
Refer to the top of this notebook for submission instructions.