Starting from:

$35

Homework 2: Food Safety SOLVED

Homework 2: Food Safety (50 Pts)
Cleaning and Exploring Data with Pandas
This Assignment
In this homework, we will investigate restaurant food safety scores for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health. The main goal for this assignment is to walk through the process of Data Cleaning and EDA.

As we clean and explore these data, you will gain practice with:

Reading simple csv files and using Pandas
Working with data at different levels of granularity
Identifying the type of data collected, missing values, anomalies, etc.
Exploring characteristics and distributions of individual variables
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.style.use('fivethirtyeight')

import zipfile
from pathlib import Path
import os # Used to interact with the file system
Importing and Verifying Data
There are several tables in the data folder. Let's attempt to load bus.csv, ins2vio.csv, ins.csv, and vio.csv into pandas dataframes with the following names: bus, ins2vio, ins, and vio respectively.

Note: Because of character encoding issues one of the files (bus) will require an additional argument encoding='ISO-8859-1' when calling pd.read_csv.

# path to directory containing data
dsDir = Path('data')

bus = pd.read_csv(dsDir/'bus.csv', encoding='ISO-8859-1')
ins2vio = pd.read_csv(dsDir/'ins2vio.csv')
ins = pd.read_csv(dsDir/'ins.csv')
vio = pd.read_csv(dsDir/'vio.csv')
Now that you've read in the files, let's try some pd.DataFrame methods (docs). Use the DataFrame.head method to show the top few lines of the bus, ins, and vio dataframes. To show multiple return outputs in one single cell, you can use display(). Currently, running the cell below will display the first few lines of the bus dataframe.

bus.head()
business id column    name    address    city    state    postal_code    latitude    longitude    phone_number
0    1000    HEUNG YUEN RESTAURANT    3279 22nd St    San Francisco    CA    94110    37.755282    -122.420493    -9999
1    100010    ILLY CAFFE SF_PIER 39    PIER 39 K-106-B    San Francisco    CA    94133    -9999.000000    -9999.000000    14154827284
2    100017    AMICI'S EAST COAST PIZZERIA    475 06th St    San Francisco    CA    94103    -9999.000000    -9999.000000    14155279839
3    100026    LOCAL CATERING    1566 CARROLL AVE    San Francisco    CA    94124    -9999.000000    -9999.000000    14155860315
4    100030    OUI OUI! MACARON    2200 JERROLD AVE STE C    San Francisco    CA    94124    -9999.000000    -9999.000000    14159702675
The DataFrame.describe method can also be handy for computing summaries of numeric columns of our dataframes. Try it out with each of our 4 dataframes. Below, we have used the method to give a summary of the bus dataframe.

bus.describe()
business id column    latitude    longitude    phone_number
count    6253.000000    6253.000000    6253.000000    6.253000e+03
mean    60448.948984    -5575.337966    -5645.817699    4.701819e+09
std    36480.132445    4983.390142    4903.993683    6.667508e+09
min    19.000000    -9999.000000    -9999.000000    -9.999000e+03
25%    18399.000000    -9999.000000    -9999.000000    -9.999000e+03
50%    75685.000000    -9999.000000    -9999.000000    -9.999000e+03
75%    90886.000000    37.776494    -122.421553    1.415533e+10
max    102705.000000    37.824494    0.000000    1.415988e+10
Now, we perform some sanity checks for you to verify that the data was loaded with the correct structure. Run the following cells to load some basic utilities (you do not need to change these at all):

First, we check the basic structure of the data frames you created:

assert all(bus.columns == ['business id column', 'name', 'address', 'city', 'state', 'postal_code',
                           'latitude', 'longitude', 'phone_number'])
assert 6250 <= len(bus) <= 6260

assert all(ins.columns == ['iid', 'date', 'score', 'type'])
assert 26660 <= len(ins) <= 26670

assert all(vio.columns == ['description', 'risk_category', 'vid'])
assert 60 <= len(vio) <= 65

assert all(ins2vio.columns == ['iid', 'vid'])
assert 40210 <= len(ins2vio) <= 40220
Next we'll check that the statistics match what we expect. The following are hard-coded statistical summaries of the correct data.

bus_summary = pd.DataFrame(**{'columns': ['business id column', 'latitude', 'longitude'],
 'data': {'business id column': {'50%': 75685.0, 'max': 102705.0, 'min': 19.0},
  'latitude': {'50%': -9999.0, 'max': 37.824494, 'min': -9999.0},
  'longitude': {'50%': -9999.0,
   'max': 0.0,
   'min': -9999.0}},
 'index': ['min', '50%', 'max']})

ins_summary = pd.DataFrame(**{'columns': ['score'],
 'data': {'score': {'50%': 76.0, 'max': 100.0, 'min': -1.0}},
 'index': ['min', '50%', 'max']})

vio_summary = pd.DataFrame(**{'columns': ['vid'],
 'data': {'vid': {'50%': 103135.0, 'max': 103177.0, 'min': 103102.0}},
 'index': ['min', '50%', 'max']})

from IPython.display import display

print('What we expect from your Businesses dataframe:')
display(bus_summary)
print('What we expect from your Inspections dataframe:')
display(ins_summary)
print('What we expect from your Violations dataframe:')
display(vio_summary)
What we expect from your Businesses dataframe:
business id column    latitude    longitude
min    19.0    -9999.000000    -9999.0
50%    75685.0    -9999.000000    -9999.0
max    102705.0    37.824494    0.0
What we expect from your Inspections dataframe:
score
min    -1.0
50%    76.0
max    100.0
What we expect from your Violations dataframe:
vid
min    103102.0
50%    103135.0
max    103177.0
The code below defines a testing function that we'll use to verify that your data has the same statistics as what we expect. Run these cells to define the function. The df_allclose function has this name because we are verifying that all of the statistics for your dataframe are close to the expected values. Why not df_allequal? It's a bad idea in almost all cases to compare two floating point values like 37.780435, as rounding error can cause spurious failures.

"""Run this cell to load this utility comparison function that we will use in various
tests below 

Do not modify the function in any way.
"""


def df_allclose(actual, desired, columns=None, rtol=5e-2):
    """Compare selected columns of two dataframes on a few summary statistics.
    
    Compute the min, median and max of the two dataframes on the given columns, and compare
    that they match numerically to the given relative tolerance.
    
    If they don't match, an AssertionError is raised (by `numpy.testing`).
    """    
    # summary statistics to compare on
    stats = ['min', '50%', 'max']
    
    # For the desired values, we can provide a full DF with the same structure as
    # the actual data, or pre-computed summary statistics.
    # We assume a pre-computed summary was provided if columns is None. In that case, 
    # `desired` *must* have the same structure as the actual's summary
    if columns is None:
        des = desired
        columns = desired.columns
    else:
        des = desired[columns].describe().loc[stats]

    # Extract summary stats from actual DF
    act = actual[columns].describe().loc[stats]

    return np.allclose(act, des, rtol)








Question 1a: Identifying Issues with the Data
Use the head command on your three files again. This time, describe at least one potential problem with the data you see. Consider issues with missing values and bad data.

We found that there are many non-existing longitudes and latitudes in the table, they were replaced by -9999, and -9999 was used in the calculation, resulting in wrong results

We will explore each file in turn, including determining its granularity and primary keys and exploring many of the variables individually. Let's begin with the businesses file, which has been read into the bus dataframe.









Question 1b: Examining the Business Data File
From its name alone, we expect the bus.csv file to contain information about the restaurants. Let's investigate the granularity of this dataset.

bus.head()
business id column    name    address    city    state    postal_code    latitude    longitude    phone_number
0    1000    HEUNG YUEN RESTAURANT    3279 22nd St    San Francisco    CA    94110    37.755282    -122.420493    -9999
1    100010    ILLY CAFFE SF_PIER 39    PIER 39 K-106-B    San Francisco    CA    94133    -9999.000000    -9999.000000    14154827284
2    100017    AMICI'S EAST COAST PIZZERIA    475 06th St    San Francisco    CA    94103    -9999.000000    -9999.000000    14155279839
3    100026    LOCAL CATERING    1566 CARROLL AVE    San Francisco    CA    94124    -9999.000000    -9999.000000    14155860315
4    100030    OUI OUI! MACARON    2200 JERROLD AVE STE C    San Francisco    CA    94124    -9999.000000    -9999.000000    14159702675
The bus dataframe contains a column called business id column which probably corresponds to a unique business id. However, we will first rename that column to bid for simplicity.

bus = bus.rename(columns={"business id column": "bid"})
Examining the entries in bus, is the bid unique for each record (i.e. each row of data)? Your code should compute the answer, i.e. don't just hard code True or False.

Hint: use value_counts() or unique() to determine if the bid series has any duplicates.

is_bid_unique = len(bus["bid"].value_counts())==len(bus)
is_bid_unique
True




Question 1c
We will now work with some important fields in bus. In the two cells below create the following two numpy arrays:

Assign top_names to the top 5 most frequently used business names, from most frequent to least frequent.
Assign top_addresses to the top 5 addressses where businesses are located, from most popular to least popular.
Hint: you may find value_counts() helpful.

Step 1

top_names = np.array(bus["name"].value_counts().to_frame().head(5).index)
top_addresses = np.array(bus["address"].value_counts().to_frame().head(5).index)
top_names, top_addresses
(array(["Peet's Coffee & Tea", 'Starbucks Coffee', "McDonald's",
        'Jamba Juice', 'STARBUCKS'], dtype=object),
 array(['Off The Grid', '428 11th St', '2948 Folsom St', '3251 20th Ave',
        'Pier 41'], dtype=object))




Question 1d
Based on the above exploration, answer each of the following questions about bus by assigning your answers to the corresponding variables

What does each record represent?
What is the minimal primary key?
# What does each record represent?  Valid answers are:
#    "One location of a restaurant."
#    "A chain of restaurants."
#    "A city block."
q1d_part1 = "One location of a restaurant."

# What is the minimal primary key? Valid answers are:
#    "bid"
#    "bid, name"
#    "bid, name, address"
q1d_part2 = "bid"
2: Cleaning the Business Data Postal Codes
The business data contains postal code information that we can use to aggregate the ratings over regions of the city. Let's examine and clean the postal code field. The postal code (sometimes also called a ZIP code) partitions the city into regions:

ZIP Code Map





Question 2a
How many restaurants are in each ZIP code?

In the cell below, create a series where the index is the postal code and the value is the number of records with that postal code in descending order of count. You may need to use groupby(), size(), or value_counts(). Do you notice any odd/invalid zip codes?

zip_counts = bus.value_counts('postal_code').sort_values(ascending=False)
print(zip_counts.to_string())
# i see some missing values replaced by -9999 and some strange value like CA
postal_code
94103         562
94110         555
94102         456
94107         408
94133         398
94109         382
94111         259
94122         255
94105         249
94118         231
94115         230
94108         229
94124         218
94114         200
-9999         194
94112         192
94117         189
94123         177
94121         157
94104         142
94132         132
94116          97
94158          90
94134          82
94127          67
94131          49
94130           8
94143           5
94188           2
94013           2
94301           2
CA              2
94101           2
94129           1
94124-1917      1
92672           1
94123-3106      1
94122-1909      1
94014           1
94105-1420      1
00000           1
94080           1
94117-3504      1
941             1
94102-5917      1
941033148       1
941102019       1
94120           1
64110           1
94105-2907      1
94901           1
95117           1
95112           1
95109           1
95105           1
94621           1
95122           1
94544           1
94602           1
95132           1
95133           1
94518           1
Ca              1
Question 2b
Answer the question about the postal_code column in the bus dataframe.

What Python data type is used to represent a ZIP code?
Note: ZIP codes and postal codes are the same thing.

Please write your answers in the variables below:

# What Python data type is used to represent a ZIP code? 
#    "str"
#    "int"
#    "bool"
#    "float"
q2b = "str"




Question 2c
In question 2a we noticed a large number of potentially invalid ZIP codes (e.g., "Ca"). These are likely due to data entry errors. To get a better understanding of the potential errors in the zip codes we will:

Import a list of valid San Francisco ZIP codes by using pd.read_json to load the file data/sf_zipcodes.json and extract a series of type str containing the valid ZIP codes. Hint: set dtype when invoking read_json.
Construct a DataFrame containing only the businesses which DO NOT have valid ZIP codes. You will probably want to use the Series.isin function.
Step 1

valid_zips = pd.read_json("data/sf_zipcodes.json",dtype=str)["zip_codes"]
valid_zips.head()
0    94102
1    94103
2    94104
3    94105
4    94107
Name: zip_codes, dtype: object
Step 2

invalid_zip_bus = bus.query("postal_code not in @valid_zips")
invalid_zip_bus.head(20)
bid    name    address    city    state    postal_code    latitude    longitude    phone_number
22    100126    Lamas Peruvian Food Truck    Private Location    San Francisco    CA    -9999    -9999.000000    -9999.000000    -9999
68    100417    COMPASS ONE, LLC    1 MARKET ST. FL    San Francisco    CA    94105-1420    -9999.000000    -9999.000000    14154324000
96    100660    TEAPENTER    1518 IRVING ST    San Francisco    CA    94122-1909    -9999.000000    -9999.000000    14155868318
109    100781    LE CAFE DU SOLEIL    200 FILLMORE ST    San Francisco    CA    94117-3504    -9999.000000    -9999.000000    14155614215
144    101084    Deli North 200    1 Warriors Way Level 300 North East    San Francisco    CA    94518    -9999.000000    -9999.000000    -9999
156    101129    Vendor Room 200    1 Warriors Way Level 300 South West    San Francisco    CA    -9999    -9999.000000    -9999.000000    -9999
177    101192    Cochinita #2    2 Marina Blvd Fort Mason    San Francisco    CA    -9999    -9999.000000    -9999.000000    14150429222
276    102014    DROPBOX (Section 3, Floor 7)    1800 Owens St    San Francisco    CA    -9999    -9999.000000    -9999.000000    -9999
295    102245    Vessell CA Operations (#4)    2351 Mission St    San Francisco    CA    -9999    -9999.000000    -9999.000000    -9999
298    10227    The Napper Tandy    3200 24th St    San Francisco    CA    -9999    37.752581    -122.416482    -9999
320    10372    BERNAL HEIGHTS NEIGBORHOOD CENTER    515 CORTLAND AVE    San Francisco    CA    -9999    37.739110    -122.416404    14155202142
321    10373    El Tonayense #1    1717 Harrison St    San Francisco    CA    -9999    37.769426    -122.413446    14155556127
322    10376    Good Frikin Chicken    10 29th St    San Francisco    CA    -9999    37.744369    -122.420967    -9999
324    10406    Sunset Youth Services    3918 Judah St    San Francisco    CA    -9999    37.760560    -122.504027    -9999
357    11416    El Beach Burrito    3914 Judah St    San Francisco    CA    -9999    37.760851    -122.503998    -9999
381    12199    El Gallo Giro    3055 23rd St    San Francisco    CA    -9999    37.754218    -122.413285    14155553048
384    12344    The Village Market & Pizza    750 Font Blvd    San Francisco    CA    -9999    37.723462    -122.483012    14155374525
406    13062    Everett Middle School    450 Church St    San Francisco    CA    -9999    37.763794    -122.428617    -9999
434    13753    Taboun    203 Parnassus Ave    San Francisco    CA    -9999    37.764574    -122.452950    -9999
548    17423    Project Open Hand    100 Diamond St    San Francisco    CA    -9999    37.760689    -122.437252    -9999




Question 2d
In the previous question, many of the businesses had a common invalid postal code that was likely used to encode a MISSING postal code. Do they all share a potentially "interesting address"?

In the following cell, construct a series that counts the number of businesses at each address that have this single likely MISSING postal code value. Order the series in descending order by count.

After examining the output, please answer the following question (2e) by filling in the appropriate variable. If we were to drop businesses with MISSING postal code values would a particular class of business be affected? If you are unsure try to search the web for the most common addresses.

missing_zip_address_count = invalid_zip_bus[invalid_zip_bus['postal_code'] == '-9999'].loc[:,'address'].value_counts()
missing_zip_address_count.head()
Off The Grid                  39
Off the Grid                  10
OTG                            4
Approved Locations             3
Approved Private Locations     3
Name: address, dtype: int64




Question 2e
Examine the invalid_zip_bus dataframe we computed above and look at the businesses that DO NOT have the special MISSING ZIP code value. Some of the invalid postal codes are just the full 9 digit code rather than the first 5 digits. Create a new column named postal5 in the original bus dataframe which contains only the first 5 digits of the postal_code column. Finally, for any of the postal5 ZIP code entries that were not a valid San Fransisco ZIP Code (according to valid_zips) set the entry to None.

bus['postal5'] = None
bus['changdu'] = None
code_lengths = invalid_zip_bus["postal_code"].str.len()
bus['changdu'] = code_lengths
bus.loc[bus['changdu']==10,'postal5'] = bus.postal_code.str[0:5]

# Checking the corrected postal5 column
bus.loc[invalid_zip_bus.index, ['bid', 'name', 'postal_code', 'postal5']]
bid    name    postal_code    postal5
22    100126    Lamas Peruvian Food Truck    -9999    None
68    100417    COMPASS ONE, LLC    94105-1420    94105
96    100660    TEAPENTER    94122-1909    94122
109    100781    LE CAFE DU SOLEIL    94117-3504    94117
144    101084    Deli North 200    94518    None
...    ...    ...    ...    ...
6173    99369    HOTEL BIRON    94102-5917    94102
6174    99376    Mashallah Halal Food truck Ind    -9999    None
6199    99536    FAITH SANDWICH #2    94105-2907    94105
6204    99681    Twister    95112    None
6241    99819    CHESTNUT DINER    94123-3106    94123
230 rows × 4 columns









3: Investigate the Inspection Data
Let's now turn to the inspection DataFrame. Earlier, we found that ins has 4 columns named iid, score, date and type. In this section, we determine the granularity of ins and investigate the kinds of information provided for the inspections.

Let's start by looking again at the first 5 rows of ins to see what we're working with.

ins.head(5)
iid    date    score    type
0    100010_20190329    03/29/2019 12:00:00 AM    -1    New Construction
1    100010_20190403    04/03/2019 12:00:00 AM    100    Routine - Unscheduled
2    100017_20190417    04/17/2019 12:00:00 AM    -1    New Ownership
3    100017_20190816    08/16/2019 12:00:00 AM    91    Routine - Unscheduled
4    100017_20190826    08/26/2019 12:00:00 AM    -1    Reinspection/Followup




Question 3a
The column iid probably corresponds to an inspection id. Is it a primary key? Write an expression (line of code) that evaluates to True or False based on whether all the values are unique.

is_ins_iid_a_primary_key = len(ins["iid"].value_counts()) == len(ins["iid"])
is_ins_iid_a_primary_key
True




Question 3b
The column iid appears to be the composition of two numbers and the first number looks like a business id.

Part 1.: Create a new column called bid in the ins dataframe containing just the business id. You will want to use ins['iid'].str operations to do this. Also be sure to convert the type of this column to int

Part 2.: Then compute how many values in this new column are invalid business ids (i.e. do not appear in the bus['bid'] column). Consider using the pd.Series.isin function.

No python for loops or list comprehensions required!

Part 1

ins["bid"] = None
ins.loc[:,"bid"] = ins.loc[:,"iid"].str.split('_').apply(lambda x: x[0]).astype("int")
ins
iid    date    score    type    bid
0    100010_20190329    03/29/2019 12:00:00 AM    -1    New Construction    100010
1    100010_20190403    04/03/2019 12:00:00 AM    100    Routine - Unscheduled    100010
2    100017_20190417    04/17/2019 12:00:00 AM    -1    New Ownership    100017
3    100017_20190816    08/16/2019 12:00:00 AM    91    Routine - Unscheduled    100017
4    100017_20190826    08/26/2019 12:00:00 AM    -1    Reinspection/Followup    100017
...    ...    ...    ...    ...    ...
26658    999_20180924    09/24/2018 12:00:00 AM    -1    Routine - Scheduled    999
26659    999_20181102    11/02/2018 12:00:00 AM    -1    Reinspection/Followup    999
26660    999_20190909    09/09/2019 12:00:00 AM    80    Routine - Unscheduled    999
26661    99_20171207    12/07/2017 12:00:00 AM    82    Routine - Unscheduled    99
26662    99_20180808    08/08/2018 12:00:00 AM    84    Routine - Unscheduled    99
26663 rows × 5 columns

Part 2

invalid_bid_count = ins['bid'].isin(bus['bid']).value_counts().loc[True]-len(ins.value_counts())
invalid_bid_count
0




Question 3c
What if we are interested in a time component of the inspection data? We need to examine the date column of each inspection.

Part 1: What is the type of the individual ins['date'] entries? You may want to grab the very first entry and use the type function in python.

Part 2: Use pd.to_datetime to create a new ins['timestamp'] column containing of pd.Timestamp objects. These will allow us to do more date manipulation.

Part 3: What are the earliest and latest dates in our inspection data? Hint: you can use min and max on dates of the correct type.

Part 4: We probably want to examine the inspections by year. Create an additional ins['year'] column containing just the year of the inspection. Consider using pd.Series.dt.year to do this.

No python for loops or list comprehensions required!

Part 1

ins_date_type = type(ins["date"][0])
ins_date_type
str
Part 2

ins["timestamp"] = pd.to_datetime(ins["date"])
ins
iid    date    score    type    bid    timestamp
0    100010_20190329    03/29/2019 12:00:00 AM    -1    New Construction    100010    2019-03-29
1    100010_20190403    04/03/2019 12:00:00 AM    100    Routine - Unscheduled    100010    2019-04-03
2    100017_20190417    04/17/2019 12:00:00 AM    -1    New Ownership    100017    2019-04-17
3    100017_20190816    08/16/2019 12:00:00 AM    91    Routine - Unscheduled    100017    2019-08-16
4    100017_20190826    08/26/2019 12:00:00 AM    -1    Reinspection/Followup    100017    2019-08-26
...    ...    ...    ...    ...    ...    ...
26658    999_20180924    09/24/2018 12:00:00 AM    -1    Routine - Scheduled    999    2018-09-24
26659    999_20181102    11/02/2018 12:00:00 AM    -1    Reinspection/Followup    999    2018-11-02
26660    999_20190909    09/09/2019 12:00:00 AM    80    Routine - Unscheduled    999    2019-09-09
26661    99_20171207    12/07/2017 12:00:00 AM    82    Routine - Unscheduled    99    2017-12-07
26662    99_20180808    08/08/2018 12:00:00 AM    84    Routine - Unscheduled    99    2018-08-08
26663 rows × 6 columns

Part 3

earliest_date = min(ins['timestamp'])
latest_date = max(ins['timestamp'])

print("Earliest Date:", earliest_date)
print("Latest Date:", latest_date)
Earliest Date: 2016-10-04 00:00:00
Latest Date: 2019-11-28 00:00:00
Part 4

ins["year"] = ins["timestamp"].dt.year
ins.head()
iid    date    score    type    bid    timestamp    year
0    100010_20190329    03/29/2019 12:00:00 AM    -1    New Construction    100010    2019-03-29    2019
1    100010_20190403    04/03/2019 12:00:00 AM    100    Routine - Unscheduled    100010    2019-04-03    2019
2    100017_20190417    04/17/2019 12:00:00 AM    -1    New Ownership    100017    2019-04-17    2019
3    100017_20190816    08/16/2019 12:00:00 AM    91    Routine - Unscheduled    100017    2019-08-16    2019
4    100017_20190826    08/26/2019 12:00:00 AM    -1    Reinspection/Followup    100017    2019-08-26    2019




Question 3d
What is the relationship between the type of inspection over the 2016 to 2019 timeframe?

Part 1

Construct the following table by

Using the pivot_table containing the number (size) of inspections for the given type and year.
Adding an extra Total column to the result using sum
Sort the results in descending order by the Total.
year    2016    2017    2018    2019    Total
type                    
Routine - Unscheduled    966    4057    4373    4681    14077
Reinspection/Followup    445    1767    1935    2292    6439
New Ownership    99    506    528    459    1592
Complaint    91    418    512    437    1458
New Construction    102    485    218    189    994
Non-inspection site visit    51    276    253    231    811
New Ownership - Followup    0    45    219    235    499
Structural Inspection    1    153    50    190    394
Complaint Reinspection/Followup    19    68    70    70    227
Foodborne Illness Investigation    1    29    50    35    115
Routine - Scheduled    0    9    8    29    46
Administrative or Document Review    2    1    1    0    4
Multi-agency Investigation    0    0    1    2    3
Special Event    0    3    0    0    3
Community Health Assessment    1    0    0    0    1
No python for loops or list comprehensions required!

ins_pivot = pd.pivot_table(ins, values='iid', index=['type'], columns=['year'], aggfunc=len)
ins_pivot['Total'] = ins_pivot.sum(axis = 1)
ins_pivot_sorted = ins_pivot.sort_values('Total',ascending=False)
ins_pivot_sorted = ins_pivot_sorted.fillna(0)
ins_pivot_sorted.astype(int)
year    2016    2017    2018    2019    Total
type                    
Routine - Unscheduled    966    4057    4373    4681    14077
Reinspection/Followup    445    1767    1935    2292    6439
New Ownership    99    506    528    459    1592
Complaint    91    418    512    437    1458
New Construction    102    485    218    189    994
Non-inspection site visit    51    276    253    231    811
New Ownership - Followup    0    45    219    235    499
Structural Inspection    1    153    50    190    394
Complaint Reinspection/Followup    19    68    70    70    227
Foodborne Illness Investigation    1    29    50    35    115
Routine - Scheduled    0    9    8    29    46
Administrative or Document Review    2    1    1    0    4
Multi-agency Investigation    0    0    1    2    3
Special Event    0    3    0    0    3
Community Health Assessment    1    0    0    0    1
Part 2

Based on the above analysis, which year appears to have had a lot of businesses in newly constructed buildings?
year_of_new_construction = 2017




Question 3e
Let's examine the inspection scores ins['score']

ins['score'].value_counts().head()
-1      12632
 100     1993
 96      1681
 92      1260
 94      1250
Name: score, dtype: int64
There are a large number of inspections with the 'score' of -1. These are probably missing values. Let's see what type of inspections have scores and which do not. Create the following dataframe using steps similar to the previous question, and assign it to to the variable ins_missing_score_pivot.

You should observe that inspection scores appear only to be assigned to Routine - Unscheduled inspections.

Missing Score    False    True    Total
type            
Routine - Unscheduled    14031    46    14077
Reinspection/Followup    0    6439    6439
New Ownership    0    1592    1592
Complaint    0    1458    1458
New Construction    0    994    994
Non-inspection site visit    0    811    811
New Ownership - Followup    0    499    499
Structural Inspection    0    394    394
Complaint Reinspection/Followup    0    227    227
Foodborne Illness Investigation    0    115    115
Routine - Scheduled    0    46    46
Administrative or Document Review    0    4    4
Multi-agency Investigation    0    3    3
Special Event    0    3    3
Community Health Assessment    0    1    1
ins['Missing Score'] = False
ins.loc[ins['score']==-1,'Missing Score'] = True

ins_missing_score_pivot = pd.pivot_table(ins, values='iid', index=['type'], columns=['Missing Score'], aggfunc=len)
ins_missing_score_pivot['Total'] = ins_missing_score_pivot.sum(axis = 1)
ins_missing_score_pivot_sorted = ins_missing_score_pivot.sort_values('Total',ascending=False)
ins_missing_score_pivot_sorted = ins_missing_score_pivot_sorted.fillna(0)
ins_missing_score_pivot_sorted.astype(int)
Missing Score    False    True    Total
type            
Routine - Unscheduled    14031    46    14077
Reinspection/Followup    0    6439    6439
New Ownership    0    1592    1592
Complaint    0    1458    1458
New Construction    0    994    994
Non-inspection site visit    0    811    811
New Ownership - Followup    0    499    499
Structural Inspection    0    394    394
Complaint Reinspection/Followup    0    227    227
Foodborne Illness Investigation    0    115    115
Routine - Scheduled    0    46    46
Administrative or Document Review    0    4    4
Multi-agency Investigation    0    3    3
Special Event    0    3    3
Community Health Assessment    0    1    1
Notice that inspection scores appear only to be assigned to Routine - Unscheduled inspections. It is reasonable that for inspection types such as New Ownership and Complaint to have no associated inspection scores, but we might be curious why there are no inspection scores for the Reinspection/Followup inspection type.









4: Joining Data Across Tables
In this question we will start to connect data across mulitple tables. We will be using the merge function.





Question 4a
Let's figure out which restaurants had the lowest scores. Before we proceed, let's filter out missing scores from ins so that negative scores don't influence our results.

ins = ins[ins["score"] > 0]
We'll start by creating a new dataframe called ins_named. It should be exactly the same as ins, except that it should have the name and address of every business, as determined by the bus dataframe. If a business_id in ins does not exist in bus, the name and address should be given as NaN.

Hint: Use the merge method to join the ins dataframe with the appropriate portion of the bus dataframe. See the official documentation on how to use merge.

Note: For quick reference, a pandas 'left' join keeps the keys from the left frame, so if ins is the left frame, all the keys from ins are kept and if a set of these keys don't have matches in the other frame, the columns from the other frame for these "unmatched" key rows contains NaNs.

ins_named = ins.merge(bus[['bid','name','address']], how='left')
ins_named.head()
iid    date    score    type    bid    timestamp    year    Missing Score    name    address
0    100010_20190403    04/03/2019 12:00:00 AM    100    Routine - Unscheduled    100010    2019-04-03    2019    False    ILLY CAFFE SF_PIER 39    PIER 39 K-106-B
1    100017_20190816    08/16/2019 12:00:00 AM    91    Routine - Unscheduled    100017    2019-08-16    2019    False    AMICI'S EAST COAST PIZZERIA    475 06th St
2    100041_20190520    05/20/2019 12:00:00 AM    83    Routine - Unscheduled    100041    2019-05-20    2019    False    UNCLE LEE CAFE    3608 BALBOA ST
3    100055_20190425    04/25/2019 12:00:00 AM    98    Routine - Unscheduled    100055    2019-04-25    2019    False    Twirl and Dip    335 Martin Luther King Jr. Dr
4    100055_20190912    09/12/2019 12:00:00 AM    82    Routine - Unscheduled    100055    2019-09-12    2019    False    Twirl and Dip    335 Martin Luther King Jr. Dr




Question 4b
Let's look at the 20 businesses with the lowest median score. Order your results by the median score followed by the business id to break ties. The resulting table should look like:

Hint: You may find the as_index argument in the groupby method important. The documentation is linked here!

bid    name    median score
3876    84590    Chaat Corner    54.0
4564    90622    Taqueria Lolita    57.0
4990    94351    VBowls LLC    58.0
2719    69282    New Jumbo Seafood Restaurant    60.5
222    1154    SUNFLOWER RESTAURANT    63.5
1991    39776    Duc Loi Supermarket    64.0
2734    69397    Minna SF Group LLC    64.0
3291    78328    Golden Wok    64.0
4870    93150    Chez Beesen    64.0
4911    93502    Smoky Man    64.0
5510    98995    Vallarta's Taco Bar    64.0
1457    10877    CHINA FIRST INC.    64.5
2890    71310    Golden King Vietnamese Restaurant    64.5
4352    89070    Lafayette Coffee Shop    64.5
505    2542    PETER D'S RESTAURANT    65.0
2874    71008    House of Pancakes    65.0
818    3862    IMPERIAL GARDEN SEAFOOD RESTAURANT    66.0
2141    61427    Nick's Foods    66.0
2954    72176    Wolfes Lunch    66.0
4367    89141    Cha Cha Cha on Mission    66.5
ins_named['median score'] = ins_named['score']
twenty_lowest_scoring = ins_named.groupby(by=['bid','name'],as_index=False).agg({'median score': 'median'})
twenty_lowest_scoring = twenty_lowest_scoring.sort_values('median score')
twenty_lowest_scoring.head(20)
bid    name    median score
3876    84590    Chaat Corner    54.0
4564    90622    Taqueria Lolita    57.0
4990    94351    VBowls LLC    58.0
2719    69282    New Jumbo Seafood Restaurant    60.5
222    1154    SUNFLOWER RESTAURANT    63.5
1991    39776    Duc Loi Supermarket    64.0
2734    69397    Minna SF Group LLC    64.0
4870    93150    Chez Beesen    64.0
4911    93502    Smoky Man    64.0
3291    78328    Golden Wok    64.0
5510    98995    Vallarta's Taco Bar    64.0
2890    71310    Golden King Vietnamese Restaurant    64.5
1457    10877    CHINA FIRST INC.    64.5
4352    89070    Lafayette Coffee Shop    64.5
505    2542    PETER D'S RESTAURANT    65.0
2874    71008    House of Pancakes    65.0
818    3862    IMPERIAL GARDEN SEAFOOD RESTAURANT    66.0
2141    61427    Nick's Foods    66.0
2954    72176    Wolfes Lunch    66.0
4367    89141    Cha Cha Cha on Mission    66.5




Question 4c
Let's now examine the descriptions of violations for inspections with score > 0 and score < 65. Construct a Series indexed by the description of the violation from the vio table with the value being the number of times that violation occured for inspections with the above score range. Sort the results in descending order of the count.

The first few entries should look like:

Unclean or unsanitary food contact surfaces                                  43
High risk food holding temperature                                           42
Unclean or degraded floors walls or ceilings                                 40
Unapproved or unmaintained equipment or utensils                             39
You will need to use merge twice.

low_score_violations = ins_named.merge(ins2vio, how='left')
low_score_violations = low_score_violations.merge(vio, how='left')
low_score_violations = low_score_violations.query('score > 0 and score < 65')
low_score_violations = low_score_violations.loc[:,'description'].value_counts()
low_score_violations.head()
Unclean or unsanitary food contact surfaces         43
High risk food holding temperature                  42
Unclean or degraded floors walls or ceilings        40
Unapproved or unmaintained equipment or utensils    39
High risk vermin infestation                        37
Name: description, dtype: int64
Question 4d
Let's figure out which restaurant had the worst scores ever (single lowest score).

In the cell below, write the name of the restaurant with the lowest inspection scores ever. You can also head to yelp.com and look up the reviews page for this restaurant. Feel free to add anything interesting you want to share.

worst_restaurant = ins_named.sort_values(by=['score']).loc[:,'name']
worst_restaurant.head(1)
10898    Lollipot
Name: name, dtype: object
Congratulations! You have finished Homework 2!

More products