$35
Describing distributions of data
Assignment Overview
There are a variety of conventional ways to visualize data - tables, histograms, bar graphs, etc. The purpose is
always to examine the distribution of variables related to your research question. You will create a plot, follow
up each graphic with a table of summary statistics (for quantitative variables) or frequency and proportion
table (for categorical), and then a summary paragraph that brings it all together.
Instructions
• Use the template provided: [RMD].
• Completely describe 2 categorical and 2 quantitative variables using
– A table of summary statistics,
– An appropriate plot with titles and axes labels,
– A short paragraph description in full complete English sentences.
Guidiance
• What is the trend in the data? What exactly does the chart show? (Use the chart title to help you
answer this question)
• Describe the shape:
– Symmetry/Skewness - Is it symmetric, skewed right, or skewed left?
– Modality - Is it uniform, unimodal, or bimodal?
• Describe the spread:
– Variability - What is the approximate range of the data (x-axis)?
– Does the variable have a lot of variability in the data (visually, are the participants responded to
many different responses or mainly just one)?
• Describe the center: What is the mean/median/midpoint of the data? (Pick one or two). Don’t
• Describe the outliers (note: there may not be any for every graph):
– Are there any outliers for the variable?
– If yes, are these true outliers or false (due to data management or input error) outliers?
• Reread your explanation for context grammer, spelling and common sense.
Submission
1. Upload the final PDF to 04 Univariate Graphing folder in Google Drive with the file name:
univ_graphing_userid.pdf by the due date.
1
Example
This example uses the mpg data set from the ggplot2 package.
library(sjPlot) # For plotting using the sjp.frq() function
library(ggplot2) # For plotting using ggplot() function
library(knitr) # To make nice tables
library(descr) # For plotting using the freq() function
mpg <- ggplot2::mpg # you would load() your clean data here
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) # options to suppress warnings and mesExample of a basic-level answer for a categorical variable
This example shows a draft style plot, direct computer output showing/copied. Poor grammar and/or sentence
structure, no attempt at explaining what the variable means, extra unnecessary or incorrect information
included. Typos.
class
freq(mpg$class)
2seater midsize minivan pickup suv
0 10 20 30 40 50 60
## mpg$class
## Frequency Percent
## 2seater 5 2.137
## compact 47 20.085
## midsize 41 17.521
## minivan 11 4.701
2
## pickup 33 14.103
## subcompact 35 14.957
## suv 62 26.496
## Total 234 100.000
theres more suvs than compacts. 2% are 2seaters. there are 5 2seaters 47 cmpact 41 midize 11
minivans 33 pickups 35% subcompacts, 62 suv and 234 total cars.
Example of a proficient-level answer for a categorical variable
This example has a cleaned up plot, full English sentences, useful text formatting of variable names and
levels. Explained what the variable was named and what it measured.
The class variable from the mpg data set is a catgorical variable that describes the type of vehicle
being measured. Some levels of this categorical variable include compact, pickup and suv.
set_theme(base = theme_classic())
sjp.frq(mpg$class)
5
(2.1%)
47
(20.1%) 41
(17.5%)
11
(4.7%)
33
(14.1%)
35
(15.0%)
62
(26.5%)
0
20
40
60
2seater compact midsize minivan pickup subcompact suv
class
Sub compact cars are the most frequently reported type of car, making up over one-quarter
(26.5%) of the cars in this data set with n=62 cars represented. The least represented car is a
compact car with n=5 (2.1%) records.
3
Example of a basic-level answer for a quantiative variable
No english description provided, no verbal explanation of what information was gained from these plots.
ggplot(mpg, aes(cty)) + geom_histogram()
0
10
20
10 20 30
cty
count
summary(mpg$cty)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 14.00 17.00 16.86 19.00 35.00
Example of a proficient-level answer for a quantitative variable
This example uses a histogram with overlaid density curve, and a boxplot to understand the shape, location
and to look for outliers. Table of summary statistics present in a nicely formatted way, digits rounded
appropriately. Plot cleaned up with appropriate axis and titles.
The cty variable records the miles per gallon (mpg) achieved during city driving. This is a
quantititative numeric variable.
ggplot(mpg, aes(x=cty)) + geom_histogram(aes(y=..density..),
fill="grey", binwidth = 2) +
geom_density() + xlab("MPG") +
ggtitle("City miles per gallon (MPG)")
4
0.000
0.025
0.050
0.075
0.100
10 15 20 25 30 35
MPG
density
City miles per gallon (MPG)
boxplot(mpg$cty)
10 20 30
kable(t(c(summary(mpg$cty), sd=sd(mpg$cty))), digits=1)
Min. 1st Qu. Median Mean 3rd Qu. Max. sd
9 14 17 16.9 19 35 4.3
The MPG in the city ranges from 9 to 35, unimodal and is slightly skewed right with a mean of
16.9 close to the median of 17 and a standard deviation of 4.3mpg. The boxplot indicates that
there are at least 4 upper end outliers achieving a city MPG of approximately over 28 mpg.
5