Starting from:

$29.99

Assignment 2: Data and Distributions

CSDS 313: Introduction to Data Analysis 
Assignment 2: Data and Distributions

Problem 1
The purpose of this exercise is to investigate how different distributions can have similar statistics
and/or visualizations. Suppose you are given a normal distribution N (µ, σ). We would like to
estimate a uniform distribution U(a, b) (i.e., the range of the distribution is [a, b]) with identical
statistics to the given normal distribution. These statistics are specified as follows:
(i) Find the parameters (a and b) of a uniform distribution in terms of µ and σ such that the mean
and standard deviation of uniform distribution is the same as the given normal distribution.
(ii) Find the parameters (a and b) of a uniform distribution in terms of µ and σ such that the
25th and 75th percentile points of the uniform distribution and the given normal distribution
are the same. Assume you can compute inverse cumulative distribution function Φ−1
(p, µ, σ)
of a normal distribution N (µ, σ) for any 0 ≤ p ≤ 1. See probit function for more information.
Hint: You should estimate the parameters of uniform distribution a and b by simply using
Φ
−1
(p, µ, σ).
For parts (i) and (ii) separately, obtain a uniform distribution U(a, b) as a function of µ and σ i.e.,
find a = fa(µ, σ) and b = fb(µ, σ). Then, estimate the parameters of uniform distributions U1(a1, b1)
and U2(a2, b2) corresponding to parts (i) and (ii) for the normal distribution N (µ = 2, σ = 5). Simulate 10 000 data points from each of the U1(a1, b1), U2(a2, b2) and N (2, 5) distributions separately.
Visualize the 3 simulated distributions using histograms, error bars, and boxplots. Compare and
comment on how the obtained uniform distributions are similar or unsimilar to the given normal
distribution. Also, compare and comment on how they are similar or unsimilar to each other.
Note that, you can compute the probit function Φ−1
(p, µ, σ) as follows:
ˆ MATLAB: norminv function.
ˆ Python: norm.ppf function in scipy.stats package.
ˆ R: qnorm function.
Problem 2
For this exercise, we will use two datasets that are provided with the assignment:
ˆ The file “airport routes.csv” contains the number of available routes of 3409 airports all
around the world (as of February 2017). Each row indicates an airport (identified with a
3-letter code) and the number of routes. For example, ”CLE, 81” indicates that Cleveland
Hopkins International Airport has outgoing flights to 81 different airports. See data source
for more information.
ˆ The file “movie votes.csv” contains the average rating (between 1 and 10) of 4392 movies in
TMDb database sorted in descending order. Each row contains a movie name and the average
TMDb vote of that movie. For example, "The Godfather", 8.4, "Interstellar",8.1 etc.
See data source for more information.
For each of these datasets, consider the following models:
(a) Suppose the given data points follow a power law distribution. Estimate the corresponding α
parameter. You can use the maximum likelihood estimation in Newman’s notes on power-law.
(b) Suppose the given data points follow an exponential distribution.
Estimate the corresponding λ parameter.
(c) Suppose the given data points follow a uniform distribution.
Estimate the corresponding range parameters [a, b] of the uniform distribution.
(d) Suppose the given data points follow a normal distribution.
Estimate the corresponding µ and σ parameters.
For each these dataset separately, compare the models you estimated in parts (a) to (d). Which
distribution do you think the data follows and why? Explain. For each model, generate random
data samples drawn from the respective distribution. Use visualizations of the empirical data and
the data you generate to support your conclusions.
Problem 3
Recall the rocket problem from exercise 3: You are working as chief data scientist at a rocket
production company. You know that your company’s competitor is assigning integer IDs to their
rockets. In other words, if the competitor produced M rockets, there is a rocket with ID i for all
1 ≤ i ≤ M. Your company’s intelligence wasable to collect the IDs of n rockets produced by the
competitor and these IDs are 1 ≤ x1 ≤ x2 ≤...≤ xn. You can assume that the IDs collected by the
intelligence represent a uniform sampling of the M IDs.
(i) What is the maximum liklihood estimator for M. Simulate the rockets and intelligence reports
to show if the maximum liklihood estimator is an unbiased estimator. (hint: make sure to
choose a large M and and large number of trials for your simulation)
(ii) Let MˆMV U = xn(
n+1
n
) − 1. Let MˆMEAN = 2(Pn
i=1xi/n) − 1 Simulate the rockets and
intelligence reports to show which of the above unbiased estimators (MˆMV U or MˆMEAN ) has
the lower variance.

More products