Starting from:

$29

CECS 535 Problem Set 1

%load_ext sql
%sql sqlite:///flights.db
Problem Set 1
Instructions / Notes:
Read these carefully

You will need to install the prettytable module to run the scripts. (eg. pip install --user prettytable)
Download flights.db from http://web.stanford.edu/class/cs145/files/flights.db and make sure it is in the same directory as this Jupyter notebook
Run the top cell above to load the database flights.db.
You may create new Jupyter notebook cells to use for e.g. testing, debugging, exploring, etc.- this is encouraged in fact!
When you see In [*]: to the left of the cell you are executing, this means that the code / query is running.
If the cell is hanging- i.e. running for too long: to restart the SQL connection, you must restart the entire python kernel
To restart the kernel using the menu bar: "Kernel >> Restart & Clear Output"), then re-execute the sql connection cell at the top
You will also need to restart the connection if you want to load a different version of the database file
Remember:
%sql [SQL] is for single line SQL queries
%%sql 
[SQL] is for multi line SQL queries
Running submit.py will run the queries and print them.
We have provided correct output from our solution in correct_output.txt.
You can diff your output with the correct output by running python sanity_check.py or python submit.py > my_output; diff my_output correct_output.txt
Your submit.py should match this output exactly. This means:
the columns should have the same names as correct_output.txt
the columns should be in the same order as correct_output.txt
Submission Instructions:
Do NOT submit your iPython notebook.
Instead, copy each query into the provided submit.py script in the corresponding variable.
Do not copy the %sql or %%sql commands, only the raw SQL.
We will autograde your queries on a database with identical schema but arbitrarily-chosen values. Therefore, do not hard-code any constants in your queries -- all of your queries should use the tables in the database.
Follow the instructions in submission_instructions.txt to upload your answers to our autograders.
Have fun!

Introduction: Travel Delays
There's nothing I dislike more than travel delays -- how about you?

In fact, I'm always scheming new ways to avoid travel delays, and I just found an amazing dataset that will help me understand some of the causes and trade-offs when traveling.

I wonder if you can use SQL to help me!

Not surprisingly... you can!

In this homework, we'll use SQL to explore airline travel delays that occurred in July 2017.

To start, let's look at the primary relation in the database we've prepared for you:

%%sql
SELECT * 
FROM flight_delays 
LIMIT 1;
Done.
year    quarter    month    day_of_month    day_of_week    fl_date    unique_carrier    airline_id    carrier    tail_num    fl_num    origin_airport_id    origin_airport_seq_id    origin_city_market_id    origin    origin_city_name    origin_state_abr    origin_state_fips    origin_state_nm    origin_wac    dest_airport_id    dest_airport_seq_id    dest_city_market_id    dest    dest_city_name    dest_state_abr    dest_state_fips    dest_state_nm    dest_wac    crs_dep_time    dep_time    dep_delay    dep_delay_new    dep_del15    dep_delay_group    dep_time_blk    taxi_out    wheels_off    wheels_on    taxi_in    crs_arr_time    arr_time    arr_delay    arr_delay_new    arr_del15    arr_delay_group    arr_time_blk    cancelled    cancellation_code    diverted    carrier_delay    weather_delay    nas_delay    security_delay    late_aircraft_delay    unnamed: 55
2017    3    7    1    6    2017-07-01    AS    19930    AS    N559AS    1    11278    1127803    30852    DCA    Washington, DC    VA    51    Virginia    38    14747    1474703    30559    SEA    Seattle, WA    WA    53    Washington    93    800    750.0    -10.0    0.0    0.0    -1.0    0800-0859    17.0    807.0    1010.0    11.0    1053    1021.0    -32.0    0.0    0.0    -2.0    1000-1059    0.0    None    0.0    None    None    None    None    None    None
Cool, there are so many columns! How many rows are there?

%%sql
SELECT COUNT(*) AS num_rows
FROM flight_delays
Done.
num_rows
509070
Wow, that's a lot of data! Good thing you don't have to answer all of my questions by hand...

You don't need to import more data into the database. However, you can find a description of each field online at https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236.

We've pre-loaded a number of additional tables that will help you decode important fields like airline_id, airport_id, and day_of_week into human-readable form for the queries below.

Please use the following cell to explore these the airlines and weekdays tables:

%%sql
Alright -- let's get started!

Part I: SQL Queries
Query 1: How long are flights delayed on average? (10 points)
Just to get a sense of the data, let's start with a simple query.

In the cell below, write a SQL query that returns the average arrival delay for the entire month of July 2017 (i.e., the whole dataset).

%%sql
Query 2: What was the worst flight delay? (10 points)
Hmm, the average doesn't look too bad! What about the worst delay?

In the cell below, write a SQL query that returns the maximum arrival delay for the entire month of July 2017 (i.e., the whole dataset).

%%sql
Query 3: What flight am I happiest I didn't take? (10 points)
Yikes! What flight was so late?

In the cell below, write a SQL query that returns the carrier (i.e., carrier), flight number, origin city name, arrival city name, and flight date for the flight with the maximum arrival delay for the entire month of July 2017. Do not hard-code the arrival delay you found above. Hint: use a subquery.

%%sql
Query 4: Which are the worst days to travel? (10 points)
Since CS145 just started, I don't have time to head to Kona anytime soon. However, I'm headed out of town for a trip next week! What day is worst for booking my flight?

In the cell below, write a SQL query that returns the average arrival delay time for each day of the week, in descending order. The schema of your relation should be of the form (weekday_name, average_delay).

Note: do not report the weekday ID. (Hint: look at the weekdays table and perform a join to obtain the weekday name.)

%%sql
Query 5: Which airlines that fly out of SFO are delayed least?
Now that I know which days to avoid, I'm curious which airline I should fly out of SFO. Since I haven't been told where I'm flying, please just compute the average for the airlines that fly from SFO.

In the cell below, write a SQL query that returns the average arrival delay time (across all flights) for each carrier that flew out of SFO at least once in July 2017 (i.e., in the current dataset), in descending order.

Note: do not report the airlines ID. (Hint: a subquery is helpful here; also, look at the airlines table and perform a join.)

%%sql
Query 6: What proportion of airlines are regularly late?
Yeesh, there are a lot of late flights! How many airlines are regularly late?

In the cell below, write a SQL query that returns the proportion of airlines (appearing in flight_delays) whose flights are on average at least 10 minutes late to arrive. Do not hard-code the total number of airlines, and make sure to use at least one HAVING clause in your SQL query.

Note: sqlite COUNT(*) returns integer types. Therefore, your query should likely contain at least one SELECT CAST (COUNT(*) AS float) or a clause like COUNT(*)*1.0.

%%sql
Query 7: How do late departures affect late arrivals?
It sure looks like my plane is likely to be delayed. I'd like to know: if my plane is delayed in taking off, how will it affect my arrival time?

The sample covariance provides a measure of the joint variability of two variables. The higher the covariance, the more the two variables behave similarly, and negative covariance indicates the variables indicate the variables tend to be inversely related. We can compute the sample covariance as:
Cov(X,Y)=1n−1∑i=1n(xi−x^)(yi−y^)
 
where  xi  denotes the  i th sample of  X ,  yi  the  i th sample of  Y , and the mean of  X  and  Y  are denoted by  x¯  and  y¯ .

In the cell below, write a single SQL query that computes the covariance between the departure delay time and the arrival delay time.

Note: we could also compute a statistic like the Pearson correlation coefficient here, which provides a normalized measure (i.e., on a scale from -1 to 1) of how strongly two variables are related. However, sqlite doesn't natively support square roots (unlike commonly-used relational databases like PostgreSQL and MySQL!), so we're asking you to compute covariance instead.

%%sql
Query 8: It was a bad week...
Which airlines had the largest absolute increase in average arrival delay in the last week of July (i.e., flights on or after July 24th) compared to the previous days (i.e. flights before July 24th)?

In the cell below, write a single SQL query that returns the airline name (not ID) with the maximum absolute increase in average arrival delay between the first 23 days of the month and days 24-31. Report both the airline name and the absolute increase.

Note: due to sqlite's handling of dates, it may be easier to query using day_of_month.

Note 2: This is probably the hardest query of the assignment; break it down into subqueries that you can run one-by-one and build up your answer subquery by subquery.

Hint: You can compute two subqueries, one to compute the average arrival delay for flights on or after July 24th, and one to compute the average arrival delay for flights before July 24th, and then join the two to calculate the increase in delay.

%%sql
Query 9: Of Hipsters and Anarchists
I'm keen to visit both Portland (PDX) and Eugene (EUG), but I can't fit both into the same trip. To maximize my frequent flier mileage, I'd like to use the same flight for each. Which airlines fly both SFO -> PDX and SFO -> EUG?

In the cell below, write a single SQL query that returns the distinct airline names (not ID, and with no duplicates) that flew both SFO -> PDX and SFO -> EUG in July 2017.

%%sql
Query 10: Decision Fatigue and Equidistance
I'm flying back to Stanford from Chicago later this month, and I can fly out of either Midway (MDW) or O'Hare (ORD) and can fly into either San Francisco (SFO), San Jose (SJC), or Oakland (OAK). If this month is like July, which leg will have the shortest arrival delay for flights leaving Chicago after 2PM local time?

In the cell below, write a single SQL query that returns the average arrival delay of flights departing either MDW or ORD after 2PM local time (crs_dep_time) and arriving at one of SFO, SJC, or OAK. Group by departure and arrival airport and return results descending by arrival delay.

Note: the crs_dep_time field is an integer formatted as hhmm (e.g. 4:15pm is 1615)

%%sql
You're done! Now submit!
Refer to the top of this notebook for submission instructions.

More products