$35
AST 325/326 (Fall 2022) Lab #1
2022-2023 1
Lab #1: Basic Photon Statistics with Python
Lab report is due at 4pm on Oct 3 (Monday). Submission on Quercus.
1. Overview and Goals
This document provides a description of the activities that we will explore in the 1st lab. Because we
use the 1
st lab to introduce many new skills, which include Python programming (potentially on Linux
or JupyterHub platform) for data analysis and visualization as well as document writing using LaTex,
this document is a straightforward guide that you can follow sequentially.
The primary goal of the 1st lab is to learn how to conduct simple statistical analysis and data
visualization using Python programming language. (Check the calendar of the TA tutorials/office
hours for tutorials on how to start Python programming.) This simple lab will serve as a steppingstone to more advanced analyses in the forthcoming labs. (For those students who already have
Python programming experiences, this will be a very easy assignment. More challenging and
advanced assignments will come later.)
Schedule: The report due is 4pm on Oct 3 on Quercus. See §5 for detailed steps.
References (available online and on the course website as well as in TA tutorials)
Python
In the class web page on Quercus, you can find “Introductory Document for Computing
Resources” by Programmer Supporter TA. The following web pages can be very useful to
learning Python:
https://docs.scipy.org/doc/scipy/getting_started.html
https://numpy.org/ (on numerical computing in Python)
https://matplotlib.org/ (on visualization in Python)
LaTex document writing: You are required to write your lab reports using LaTex in this course.
The Overleaf web page at https://www.overleaf.com/ provides the service that you can use free.
Note that LaTex is a default editor for astronomy and many fields in sciences and engineering.
Linux: For those who use Python on Linux (including JupyterHub), it is important to
understand basic Linux commands.
2. Basic Statistical Parameters and Data Visualization
Let’s assume that you have conducted a series of astronomical observations to estimate the distance to
a nearby star. As below, you repeated the same observations 30 times and obtained the following
results – the numbers are distances (in parsec) to the star obtained in each trial, whereas the numbers in
the parentheses are the 1- measurement uncertainty (= uncertainty at the confidence level of 68.3%
assuming that it follows a Gaussian distribution).
Distance Measurements:
38.91 (1.41) 37.14 (0.36) 38.19 (0.69) 41.03 (3.53) 34.86 (2.64)
37.33 (0.17) 35.16 (2.34) 37.96 (0.46) 36.93 (0.57) 40.41 (2.91)
29.50 (8.00) 37.33 (0.17) 41.84 (4.34) 37.53 (0.03) 34.12 (3.38)
34.11 (3.39) 37.94 (0.44) 34.43 (3.07) 36.68 (0.82) 41.31 (3.81)
39.61 (2.11) 35.48 (2.02) 34.98 (2.52) 39.05 (1.55) 39.62 (2.12)
37.96 (0.46) 39.02 (1.52) 37.47 (0.03) 33.76 (3.74) 36.51 (0.99)
AST 325/326 (Fall 2022) Lab #1
2022-2023 2
The first thing you want to do with the data is to examine the distribution of the measurements. The
left panel in Figure 1 is a plot comparing all the 30 measurements in sequence, while the right panel
shows a histogram of the data. Do you see the difference? Which is better?
Figure 1. (Left) Distribution of the measured distances. (Right) Histogram of the measure distances.
The next step is to calculate a statistical parameter that represents the measured distance best. We often
use mean for this and standard deviation (= STDDEV) for its 1- uncertainty. From the data above, I
have the following mean and STDDEV:
Mean ± STDDEV = 37.21 ± 2.65 (pc)
using the following definition of STDDEV (= )
where xi, , and N represent measured value, the mean, and the number of measurements, respectively.
However, each measurement has a different uncertainty, and the measurements with a smaller
uncertainty should be more reliable than those with a large one. Using weighted mean statistics (in
which the measurements with a smaller uncertainty are given a more weight), now I have the following
(weighted) mean and STDDEV:
(Weighted) Mean ± STDDEV = 37.50 ± 0.02 (pc)
You can confirm that the weighted mean has a much smaller (or reliable) uncertainty.
3. Poisson Distribution and Gaussian Distribution
Often astronomical signals (e.g., photons detected with an astronomical detector based on photoelectric
effect) follow the Poisson probability distribution as below
AST 325/326 (Fall 2022) Lab #1
2022-2023 3
𝑃(𝑥; 𝜇) =
𝜇
𝑥
𝑥!
𝑒
−𝜇
(Poisson distribution)
where P(x;µ) is the probability x events will be measured within an interval given a mean of µ events.
Note that a Poisson distribution is a discrete (as opposed to continuous) distribution and cannot take a
negative mean value. As seen in Figure 2 below, the Poisson distribution is an asymmetric distribution
when the mean value is small, but it becomes more symmetric for larger mean values.
Figure 2. Poisson probability distribution for different mean values.
The Poisson distribution has the following unique property
STDDEV = √(Mean) (for Poisson distribution)
in which standard deviation is determined by mean value.
Now let’s assume that you measured the number of incoming photons from an astronomical source per
second 30 consecutive times as below. The expected mean value is 12 photons per second.
Measured Photon Count Rates (= Number of incoming photons per second)
13 17 18 14 11 8 21 18 9 12
9 17 14 6 10 16 16 11 10 12
8 20 14 10 14 17 13 16 12 10
First, let’s calculate the mean and standard deviation as follows:
Mean ± STDDEV = 13.2 ± 3.8 (counts/second),
and, since √13.2 3.6, you can see that the above relation of STDDEV = √(Mean) for a Poisson
distribution (approximately) applies in these measurements.
Figure 3 below compares the histogram (solid line) of the 30 measured photon count rates and the
expected Poisson distribution (dashed line) with = 12 calculated using the Poisson distribution
formula above. Do you think the data follow the Poisson distribution of = 12? (Note that the dashed
line in the figure was calculated by normalizing probabilities from a Poission distribution of = 12 and
then multiplying the number of measurements to be compared with the measurement histogram.)
AST 325/326 (Fall 2022) Lab #1
2022-2023 4
Figure 3. (Solid line) Histogram of the photon count rates; (Dashed line) Expected Poisson distribution with = 12.
In comparison, the Gaussian probability distribution, which is often called “normal distribution” and
we are more familiar with, takes the following form
𝑃(𝑥; 𝜇, 𝜎) =
1
𝜎√2𝜋
𝑒𝑥𝑝 [−
1
2
(
𝑥−𝜇
𝜎
)
2
]
and it is a continuous distribution whose mean can be a negative value. As in Figure 4 below, the
Gaussian distribution is a symmetric distribution wherein 68.3 % of the entire probability lies within
(Mean) (STDDEV). Random variables (e.g., random noise) follow the Gaussian probability
distribution. In contrast to the Poisson distribution, the standard deviation cannot be determined by the
mean value in a Gaussian distribution.
Figure 4. Gaussian probability distribution. (Wikipedia)
Note that for a large mean value, a Poisson distribution becomes very similar to a Gaussian
distribution (see Figure 2).
AST 325/326 (Fall 2022) Lab #1
2022-2023 5
4. Linear Least Squares Fitting
One of the primary skills we need to learn is “linear least squares fitting”. Often observations and
experimental measurements are undetermined, which means that they are limited by the number of
observations or sampling to calculate an undetermined parameter space. To correct for this, we use an
equation to model a set of data and compare the difference between the observed values to the fitted
values from the model. This difference is referred to as residuals. The term “least-squares” refers to
minimizing the square of the residuals to determine the best-fit model to the observed data set.
In this lab, we will first focus on linear-least squares where the model is a straight line, but we will
generalize the least squares method to other non-linear functions. It is important that in this lab you do
not use a canned least-squares routine and you write your own least-square routine.
4.1 Fitting a Straight Line
Suppose that we have a set of N observations (xi, yi) where we believe that the measured value, y,
depends linearly on x, i.e.,
For example, suppose a body is moving with constant velocity, what is the speed (m) and initial (c)
position of the object?
Given our data, what is the best estimate of m and c? Assume that the independent variable, xi, is
known exactly, and the dependent variable, yi, is drawn from a Gaussian probability distribution
function with constant standard deviation i = const. Under these circumstances the most likely values
of m and c are those corresponding to the straight line with the total minimum square deviation, i.e.,
the quantity
is minimized when m and c have their most likely values. Figure 5 shows a typical deviation.
The best values of m and c are found by solving the simultaneous equations,
Evaluating the derivatives yields
This set of equations can conveniently be expressed compactly in matrix form,
AST 325/326 (Fall 2022) Lab #1
2022-2023 6
and then solved by multiplying both sides by the inverse,
The inverse can be computed analytically, or in Python it is trivial to compute the inverse numerically,
as follows.
Figure 5: Example data with a least squares fit to a straight line. A typical deviation from the straight line is illustrated.
4.2 Example Python Script
# Test least squares fitting by simulating some data.
import numpy as np
import matplotlib.pyplot as plt
nx = 20 # Number of data points
m = 1.0 # Gradient
c = 0.0 # Intercept
x = np.arange(nx,dtype=float) # Independent variable
y = m * x + c # dependent variable
# Generate Gaussian errors
sigma = 1.0 # Measurement error
np.random.seed(1) # init random no. generator
errors = sigma*np.random.randn(nx) # Gaussian distributed errors
AST 325/326 (Fall 2022) Lab #1
2022-2023 7
ye = y + errors # Add the noise
plt.plot(x,ye,'o',label='data')
plt.xlabel('x')
plt.ylabel('y')
# Construct the matrices
ma = np.array([ [np.sum(x**2), np.sum(x)],[np.sum(x), nx ] ] )
mc = np.array([ [np.sum(x*ye)],[np.sum(ye)]])
# Compute the gradient and intercept
mai = np.linalg.inv(ma)
print 'Test matrix inversion gives identity',np.dot(mai,ma)
md = np.dot(mai,mc) # matrix multiply is dot
# Overplot the best fit
mfit = md[0,0]
cfit = md[1,0]
plt.plot(x, mfit*x + cfit)
plt.axis('scaled')
plt.text(5,15,'m = {:.3f}\nc = {:.3f}'.format(mfit,cfit))
plt.savefig('lsq1.png')
See the figure below for the output of this program.
Figure 5. Least squares straight line fit. The true values are m = 1 and c = 0.
AST 325/326 (Fall 2022) Lab #1
2022-2023 8
4.3 Error Propagation
What are the uncertainties in the slope and the intercept? To begin the process of error propagation we
need the inverse matrix
so that we can compute analytic expressions for m and c,
.
The analysis of error propagation shows that if z = z(x1, x2, .. xN) and the individual measurements xi
are uncorrelated (they have zero covariance) then the standard deviation of the quantity z is
.
If the data points were correlated, then we would have a covariance matrix. The diagonal elements of
this matrix are the standard deviations σii2 and of the off-diagonal elements σij2 = ‹xixj›-‹xi›‹xj›.
Thus, ignoring any possible covariance
The expression for the derivative of the gradient, m, is
because , where δ is the Kroneker. If we assume that the measurement error is the same
for each measurement, then
AST 325/326 (Fall 2022) Lab #1
2022-2023 9
Similarly, for the intercept, c,
and hence
If we do not know standard deviation a priori, the best estimate is derived from the deviations from
the fit, i.e.,
Previously, when we compute the standard deviation, the mean is unknown and we have to estimate it
from the data themselves; hence, the Bessel correction factor of 1/(N-1), because there are N-1 degrees
of freedom. In the case of the straight line fit there are two unknowns and there are N-2 degrees of
freedom.
AST 325/326 (Fall 2022) Lab #1
2022-2023 10
5. Lab #1 Assignment
Below is the list of steps that you need to follow for Lab #1 assignment. (Before you jump to them,
however, I recommend you repeat calculations and plots above using your own Python codes for your
own exercise.) Using basic built-in functions in Python (e.g., numpy.mean, numpy.std, etc) is allowed it
not specified otherwise.
1. [10 pts] Reproduce Figure 3 using your own Python code. Your plot does not have to be
exactly the same as Figure 3 in every details but must provide a histogram compared with an
expected Poisson distribution.
2. In the following website,
https://drive.google.com/drive/folders/1OxiihoBAsvf05nLN-ZRv6D9RFJZFkqKn?usp=sharing
you can download two files that are assigned to you. (The file names have your last name and
Login ID.) One file is named “*Small.txt”, while the other is “*Large.txt.” Each file has 1000
measurements of photon count rates from a star. The difference between “Small” and “Large”
is the integration time wherein the latter has a much longer integration time than the former, so
larger count rates in “Large” files than “Small.” (If there is no file for you, you need to use the
file named “Noname-noname*.”)
3. [10 pts] Read the two files in your Python code and calculate the mean and standard deviation
of the measurements recorded in each file (so you will have values for “Small” and “Large”
files respectively). Are they consistent with what you expect from a Poisson distribution?
4. [15 pts] Plot the measurements to examine their distributions. First, you can simply plot the
measurements in sequence and then using a histogram (like Figure 1 above). These analyses
and plots need to be done separately for “Small” and “Large” measurements. By glancing the
plots, can you roughly estimate the mean and standard deviation of the two measurements?
Compare the histogram that you created for the “Small” data with what is expected from a
Poisson distribution as Figure 3 above.
5. [15 pts] We learn in the class when the expected average is large, a Poisson distribution
becomes very similar to a Gaussian distribution. Let’s confirm this. First, between the
histograms that you created for the “Small” and “Large” measurements, do you think the
“Large” measurement looks like a Gaussian distribution more than the “Small” measurement
(or vice versa)? Next, let’s overplot the Gaussian distribution expected from the mean and
standard deviation that you calculated in Step 3 on the histograms that you created in Step 4.
Does the Gaussian distribution for the “Large” measurements look more similar to the data
than the “Small” measurements?
6. [30 pts] Calculation of the Hubble constant using the straight-line linear least squares fitting. In
the folder linked below
https://drive.google.com/drive/folders/1khjxHLR_PM0RaWpy-JfNtXQCuRI0Zlh9?usp=sharing
AST 325/326 (Fall 2022) Lab #1
2022-2023 11
you can find a file named as “[Your Last Name]-[Your Login ID]-Hubble.txt.” (If there is no
file for you, you need to use “Noname-noname-Hubble.txt.”) The file has two columns: the
first column is a list of the distances (in unit of mega-parsec [Mpc]) of 13 galaxies in the range
of 12000 Mpc; the second column gives the observed receding velocities of the galaxies in
unit of km s1
. Using the data in the file, calculate the Hubble constant with its uncertainty by
applying the linear least squares fitting method. You need to use your own codes for this
assignment; in other words, you are NOT allowed to use a built-in Python library for the
fitting.
7. [20 pts for the quality of your report] Prepare your report in LaTex and submit your report
in PDF format. The Python codes that you developed for the analyses and plots need to be
attached to your report in the same PDF file. Clearly explain your results and show plots in
your report. You also need to specify the names of the data files that you use in your report.
This is a practical assignment which always takes longer than you expect. So, start early!