$30
COSC 311 - Lab 4
1 Objectives
1. Practice efficiently manipulating data with Python
2. Use the matplotlib, pandas libraries
3. Gain familiarity with statistical tools
2 Tasks
1. You may submit this lab in groups of one or two.
2. Download the “Adult” data set from the UCI Machine Learning data repository: https://archive.
ics.uci.edu/ml/datasets/Adult. This dataset is record of adults, along with various occupational
and lifestyle attributes. Each adult is “labeled” as to whether or not they make more or less than $50k
per year. Using this as a driving label, one would typically want to design a process to determine what
combinations of factors enable a person to make more than $50k per year.
(a) Read the data into a pandas DataFrame object.
(b) Use the data and the numpy library to compute the following:
i. What are the 25th, 50th, and 75th pecentiles of the “education-num” field?
ii. What is the probability that an adult makes more than $50k given that their education-num
is within the ranges defined by the above quantiles (from 0 to the 25th percentile, from the
25th to the 50th etc)?
iii. Plot the change in probability that a person makes more and less than $50k given their years
of education.
iv. What is the covariance between the number of hours worked per week and education-num?
v. Use the pandas.DataFrame.boxplot functionality to create a box-and-whisker plot which
illustrates the spread of hours worked among adults who make both more and less than $50k.
vi. Use the pandas.DataFrame.boxplot functionality to create a box-and-whisker plot which
illustrates the spread of hours worked among adults from each native country and who make
more and less than $50k.
vii. Create a table where entry (x, y) contains the conditional probability
P(A random adult has level of education x|they have level of education y).
viii. Create a table where entry (x, y) contains the conditional probability of having marital status
x given that they have occupation y.
ix. What is the conditional probability of making more or less than $50k given that a person
works in each different occupation?
1
x. Plot the change in probability that a person makes more and less than $50k given the amount
that they work per week.
3. Answer the following questions using the fundamentals of probability.
(a) If A and B are independent, show that A¯ and B, A¯ and B, A¯ and B¯ are independent.
(b) Suppose we send 30% of our products to company A and 70% of our products to company B.
Company A reports that 5% of our products are defective and company B reports that 4% of
our products are defective. For each probability below, compute the precise value by hand, and
also write a short Python script to simulate the above scenario and estimate each probability by
empirically examining the rates of each event.
i. Find the probability that a product is sent to company A and it is defective.
ii. Find the probability that a product is sent to company A and it is not defective.
iii. Find the probability that a product is sent to company B and it is defective.
iv. Find the probability that a product is sent to company B and it is not defective.
(c) Show that for events A and B that P(A|B) > P(A) implies P(B|A) > P(B).
3 Submission
Zip your source files and upload them to the assignment page on MyClasses. Be sure to include all source
files, properly documented, a README file to describe the program and how it works, along with answers to
any above discussion questions.
2