$29.99
Case Study 2 - Analyzing data from MovieLens
Data Science with R
Introduction
In this case study we will look at the movies data set from MovieLens. It contains data about users and how
they rate movies.
Problem 1: Importing the MovieLens data set and merging it into a single data
frame
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mlData <- distinct(mlData)
print(colnames(mlData))
## [1] "user_id" "movie_title" "genre" "rating" "release_date"
## [6] "age" "gender" "occupation"
Report some basic details of the data you collected. For example:
• How many movies have an average rating over 4.5 overall?
– I found there were 11 movies with a mean rating over 4.5.
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
summarise(mean_rating = mean(rating, na.rm = TRUE))
high_mean_rating_count <- mlData_aggregates %>%
filter(mean_rating > 4.5) %>%
nrow()
mlData_aggregates %>%
filter(mean_rating > 4.5) %>%
head(high_mean_rating_count)
## # A tibble: 11 x 2
## movie_title mean_rating
## <fct> <dbl>
## 1 "Aiqing wansui (1994)" 5
1
## 2 "Entertaining Angels: The Dorothy Day Story (1996)" 5
## 3 "Great Day in Harlem, A (1994)" 5
## 4 "Marlene Dietrich: Shadow and Light (1996) " 5
## 5 "Pather Panchali (1955)" 4.62
## 6 "Prefontaine (1997)" 5
## 7 "Saint of Fort Washington, The (1993)" 5
## 8 "Santa with Muscles (1996)" 5
## 9 "Someone Else's America (1995)" 5
## 10 "Star Kid (1997)" 5
## 11 "They Made Me a Criminal (1939)" 5
• How many movies have an average rating over 4.5 among men?
– I found there were 18 movies with a mean rating over 4.5 among women.
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
filter(gender == "M") %>%
summarise(mean_rating_men = mean(rating, na.rm = TRUE)) %>%
full_join(mlData_aggregates)
## Joining, by = "movie_title"
high_mean_men_count <- mlData_aggregates %>%
filter(mean_rating_men > 4.5) %>%
nrow()
mlData_aggregates %>%
arrange(desc(mean_rating_men)) %>%
select(-mean_rating) %>%
head()
## # A tibble: 6 x 2
## movie_title mean_rating_men
## <fct> <dbl>
## 1 Aiqing wansui (1994) 5
## 2 Delta of Venus (1994) 5
## 3 Entertaining Angels: The Dorothy Day Story (1996) 5
## 4 Great Day in Harlem, A (1994) 5
## 5 Leading Man, The (1996) 5
## 6 Letter From Death Row, A (1998) 5
• How many movies have an average rating over 4.5 among women?
– I found by using a similar approach to the above that there were 16 movies with a mean rating
over 4.5 among women.
## Joining, by = "movie_title"
## # A tibble: 6 x 2
## movie_title mean_rating_women
## <fct> <dbl>
## 1 Everest (1998) 5
## 2 Faster Pussycat! Kill! Kill! (1965) 5
## 3 Foreign Correspondent (1940) 5
## 4 Maya Lin: A Strong Clear Vision (1994) 5
## 5 Mina Tannenbaum (1994) 5
## 6 Prefontaine (1997) 5
2
• Let us order by mean rating but keep men/women mean rating columns for comparison. Note that
some movies were not rated by both men and women.
## # A tibble: 10 x 4
## movie_title mean_rating_wom~ mean_rating_men mean_rating
## <fct> <dbl> <dbl> <dbl>
## 1 "Prefontaine (1997)" 5 5 5
## 2 "Someone Else's America (1995)" 5 NA 5
## 3 "Aiqing wansui (1994)" NA 5 5
## 4 "Entertaining Angels: The Dorot~ NA 5 5
## 5 "Great Day in Harlem, A (1994)" NA 5 5
## 6 "Marlene Dietrich: Shadow and L~ NA 5 5
## 7 "Saint of Fort Washington, The ~ NA 5 5
## 8 "Santa with Muscles (1996)" NA 5 5
## 9 "Star Kid (1997)" NA 5 5
## 10 "They Made Me a Criminal (1939)" NA 5 5
• How many movies have an median rating over 4.5 among men over age 30?
– I found there were 47 movies with a median rating over 4.5 among men over 30.
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
filter(gender == "M", age > 30) %>%
summarise(median_rating_men30plus = median(rating)) %>%
full_join(mlData_aggregates)
## Joining, by = "movie_title"
high_median_men30plus_count <- mlData_aggregates %>%
filter(median_rating_men30plus > 4.5) %>%
nrow()
mlData_aggregates %>%
arrange(desc(median_rating_men30plus)) %>%
select(movie_title, median_rating_men30plus) %>%
head(10)
## # A tibble: 10 x 2
## movie_title median_rating_men30plus
## <fct> <dbl>
## 1 Aiqing wansui (1994) 5
## 2 Anna (1996) 5
## 3 Aparajito (1956) 5
## 4 Big Sleep, The (1946) 5
## 5 Casablanca (1942) 5
## 6 Citizen Kane (1941) 5
## 7 Close Shave, A (1995) 5
## 8 Delta of Venus (1994) 5
## 9 Entertaining Angels: The Dorothy Day Story (1996) 5
## 10 Faithful (1996) 5
-How many movies have an median rating over 4.5 among women over age 30?
• I found using a simlar approach to the above that there were 70 movies with a median rating over 4.5
among women over 30.
## Joining, by = "movie_title"
3
## [1] 70
## # A tibble: 10 x 2
## movie_title median_rating_women30plus
## <fct> <dbl>
## 1 Amateur (1994) 5
## 2 Angel Baby (1995) 5
## 3 Bent (1997) 5
## 4 Best Men (1997) 5
## 5 Big Lebowski, The (1998) 5
## 6 Blade Runner (1982) 5
## 7 Brassed Off (1996) 5
## 8 Braveheart (1995) 5
## 9 Casablanca (1942) 5
## 10 Cats Don't Dance (1997) 5
• For comparison, order by median rating but keep men/women over 30 median rating columns. Note
here that some movies were not rated by both men over 30 and women over 30.
## Joining, by = "movie_title"
## # A tibble: 10 x 4
## movie_title median_rating median_rating_men~ median_rating_wome~
## <fct> <dbl> <dbl> <dbl>
## 1 Aiqing wansui (1994) 5 5 NA
## 2 Aparajito (1956) 5 5 NA
## 3 Casablanca (1942) 5 5 5
## 4 Citizen Kane (1941) 5 5 4
## 5 Close Shave, A (1995) 5 5 5
## 6 Entertaining Angels: Th~ 5 5 NA
## 7 Faust (1994) 5 5 NA
## 8 Godfather, The (1972) 5 5 4
## 9 Great Day in Harlem, A ~ 5 5 NA
## 10 Hugo Pool (1997) 5 1 NA
• What are the ten most “popular” movies?
– Perhaps we might consider a movie popular if it has both a high mean and median rating. I found
there were many films with a mean rating of 5 and many films with a median rating of 5. Upon
finding the intersection of these two sets, it turned out that there were 10 films with both a mean
rating of 5 and a median rating of 5. Without considering rating count, we could propose that the
top ten most popular films are these 10 films with median and mean rating of 5. ‘
mlData_popular <- mlData_aggregates %>%
filter(
(mean_rating == 5) & (median_rating == 5)
)
mlData_popular %>%
select(movie_title, median_rating, mean_rating) %>%
head(nrow(mlData_popular))
## # A tibble: 10 x 3
## movie_title median_rating mean_rating
## <fct> <dbl> <dbl>
## 1 "Aiqing wansui (1994)" 5 5
## 2 "Entertaining Angels: The Dorothy Day Story (1996)" 5 5
## 3 "Great Day in Harlem, A (1994)" 5 5
4
## 4 "Marlene Dietrich: Shadow and Light (1996) " 5 5
## 5 "Prefontaine (1997)" 5 5
## 6 "Saint of Fort Washington, The (1993)" 5 5
## 7 "Santa with Muscles (1996)" 5 5
## 8 "Someone Else's America (1995)" 5 5
## 9 "Star Kid (1997)" 5 5
## 10 "They Made Me a Criminal (1939)" 5 5
• Make some conjectures about how easy various groups are to please!
– Question: Does the mean rating of all films depend on the age of the reviewer?
– Answer: It appears not, at least without grouping by further characteristics.
2.8
3.2
3.6
4.0
20 40 60
Critic Age
Mean Film Rating
Age does not strongly affect overall film ratings
Average Film Rating vs. Critic Age
• Question: Do some film genres just generally receive higher ratings than other genres? Do some film
genres perform well with certain groups but poorly with other groups?
• Answer: The highest rated genre (using mean rating of all films within each genre) is Film-Noir with a
mean rating of about 3.92 while the lowest rated genre is Fantasy with a mean rating of about 3.22.
• Answer: Usually there is not much difference between ratings by men vs women but men tend to enjoy
Film-Noir more than women while women enjoy Musicals more than men.
• Answer: Children (critics aged less than 16) enjoy war, sci-fi, and animation films the most.
## Joining, by = "genre"
## Joining, by = "genre"
5
0
1
2
3
4
Fantasy Horror Childrens Drama War Film−Noir
Film Genre
Mean Film Rating
The top 3 highest and 3 lowest rated genres
Considering Mean Film Rating by Genre
## # A tibble: 5 x 3
## genre mean_rating_men mean_rating_women
## <fct> <dbl> <dbl>
## 1 Film-Noir 3.97 3.74
## 2 Mystery 3.67 3.56
## 3 Western 3.64 3.51
## 4 Musical 3.47 3.64
## 5 Childrens 3.32 3.43
6
0
1
2
3
Thriller Western Romance Horror Crime Animation Sci−Fi War
Film Genre
Mean Film Rating
Highest average rated genres among children
+ Question: How do critic ratings depend on how old a film is? + Answer: Apparently no, but there appears
to be more variance in ratings among newer films.
7
1
2
3
4
5
1922−01−01 1926−01−01 1930−01−01 1931−01−01 1932−01−01 1933−01−01 1934−01−01 1935−01−01 1936−01−01 1937−01−01 1938−01−01 1939−01−01 1940−01−01 1941−01−01 1942−01−01 1943−01−01 1944−01−01 1945−01−01 1946−01−01 1947−01−01 1948−01−01 1949−01−01 1950−01−01 1951−01−01 1952−01−01 1953−01−01 1954−01−01 1955−01−01 1956−01−01 1957−01−01 1958−01−01 1959−01−01 1960−01−01 1960−06−28 1961−01−01 1962−01−01 1963−01−01 1964−01−01 1965−01−01 1966−01−01 1967−01−01 1968−01−01 1969−01−01 1970−01−01 1971−01−01 1971−12−20 1972−01−01 1973−01−01 1974−01−01 1975−01−01 1975−05−17 1976−01−01 1976−03−08 1977−01−01 1978−01−01 1979−01−01 1980−01−01 1981−01−01 1981−03−08 1982−01−01 1983−01−01 1984−01−01 1985−01−01 1986−01−01 1986−04−26 1987−01−01 1988−01−01 1988−03−29 1989−01−01 1990−01−01 1991−01−01 1992−01−01 1993−01−01 1994−01−01 1994−09−16 1995−01−01 1995−08−14 1995−09−25 1995−10−30 1995−12−18 1996−01−01 1996−01−15 1996−01−22 1996−01−29 1996−02−02 1996−02−05 1996−02−09 1996−02−16 1996−02−21 1996−02−23 1996−02−28 1996−03−01 1996−03−08 1996−03−09 1996−03−15 1996−03−22 1996−03−23 1996−03−29 1996−03−30 1996−04−02 1996−04−03 1996−04−05 1996−04−12 1996−04−19 1996−04−23 1996−04−26 1996−04−28 1996−05−01 1996−05−03 1996−05−10 1996−05−17 1996−05−22 1996−05−24 1996−05−31 1996−06−05 1996−06−07 1996−06−14 1996−06−21 1996−06−28 1996−06−29 1996−07−03 1996−07−12 1996−07−13 1996−07−17 1996−07−19 1996−07−22 1996−07−26 1996−07−31 1996−08−02 1996−08−07 1996−08−09 1996−08−16 1996−08−21 1996−08−23 1996−08−30 1996−09−04 1996−09−06 1996−09−13 1996−09−14 1996−09−16 1996−09−18 1996−09−20 1996−09−24 1996−09−25 1996−09−27 1996−09−28 1996−10−04 1996−10−05 1996−10−09 1996−10−11 1996−10−16 1996−10−18 1996−10−19 1996−10−25 1996−10−26 1996−10−30 1996−11−01 1996−11−08 1996−11−13 1996−11−15 1996−11−22 1996−11−27 1996−11−30 1996−12−06 1996−12−13 1996−12−15 1996−12−18 1996−12−20 1996−12−25 1996−12−27 1997−01−01 1997−01−10 1997−01−17 1997−01−24 1997−01−29 1997−01−31 1997−02−07 1997−02−14 1997−02−21 1997−02−28 1997−03−05 1997−03−07 1997−03−14 1997−03−21 1997−03−26 1997−03−28 1997−04−04 1997−04−11 1997−04−18 1997−04−22 1997−04−25 1997−04−30 1997−05−01 1997−05−02 1997−05−09 1997−05−14 1997−05−16 1997−05−23 1997−05−30 1997−06−06 1997−06−13 1997−06−20 1997−06−27 1997−07−04 1997−07−11 1997−08−01 1997−08−08 1997−08−15 1997−08−22 1997−08−29 1997−09−01 1997−09−19 1997−09−26 1997−10−17 1997−12−18 1997−12−23 1997−12−25 1997−12−26 1997−12−31 1998−01−01 1998−01−09 1998−01−16 1998−01−21 1998−01−23 1998−01−30 1998−02−01 1998−02−06 1998−02−11 1998−02−13 1998−02−20 1998−03−06 1998−03−10 1998−03−14 1998−03−17 1998−03−20 1998−03−27 1998−04−03 1998−10−09 1998−10−23
Release Date
Mean Rating
Mean Film Ratings by Release Date
• Question: How does occupation relate to mean ratings?
• Answer: The occupations that tend to rate movies the highest are the unemployed, doctors, lawyers,
educators, and artists. Since the unemployed are the easiest to please we might consider focusing on
this group. However, the unemployed may not have as much money to spend on movies, so consider
next what types of films are best liked by lawyers.
8
0
1
2
3
artist educator doctor lawyer none
Occupation
Mean Movie Rating Assigned
The 5 highest mean movie ratings when grouped by occupation
Easiest Types of Occupations to Entertain
9
0
1
2
3
4
Adventure Musical Horror Documentary Mystery War Film−Noir
Genre
Mean Rating
Favorite Genres of Lawyer
Problem 2: Expand our investigation to histograms
An obvious issue with any inferences drawn from Problem 1 is that we did not consider how
many times a movie was rated.
• Plot a histogram of the ratings of all movies.
10
0
20000
40000
60000
1 2 3 4 5
Critic Rating
Frequency
Rating distribution among all observations
Movie Ratings
• Plot a histogram of the number of ratings each movie received.
## Joining, by = "movie_title"
11
0
50
100
150
200
0 1000 2000 3000
Number of Times Rated
Frequency
Movie Rating Frequencies Distribution
12
0
100
200
300
0 100 200 300 400 500
Number of Times Rated
Frequency
Frequencies for movies rated 500 times or fewer
Movie Rating Frequencies Distribution
• Plot a histogram of the average rating for each movie.
• Plot a histogram of the average rating for movies which are rated more than 100 times.
– Notice that when we include movies with 100 or fewer ratings, there are more mean ratings on the
ends of the distribution. So when we reduce the dataset to just films with more than 100 ratings,
the distribution of rating means tends to have lower variance.
– Generally speaking, it is better to trust that a movie with high mean rating and a high number
(>100) of critic ratings than a movie with high mean rating and a low number (<= 100) of critic
ratings. The reason is that infrequently rated movies are more likely to have very high or very low
mean. In contrast, frequently rated films are more likely to have a moderate mean rating (between
2 and 4). So a frequently rated film with high mean rating is robust to increases in the number of
ratings while an infrequently rated film with high mean may have a high mean due to chance.
13
0
50
100
150
200
1 2 3 4 5
Mean Rating
Frequency
Distribution of Mean Rating of Films
14
0
20
40
60
80
2.0 2.5 3.0 3.5 4.0 4.5
Mean Rating
Frequency
Considering films with more than 100 critic ratings
Distribution of Mean Rating of Films
• Make some conjectures about the distribution of ratings!
– Question: We saw that movies with a large number of ratings or few ratings may tend to have
more extreme results. Do films with a large number of ratings do better or worse than those with
a moderate number of ratings? What about films with very few ratings.
– Answer: It looks like films that are rated often tend to have a higher mean rating compared to the
whole dataset while films that have very few ratings have a lower mean rating compared to the
whole dataset.
count_deciles = quantile(mlData_aggregates$rating_count, c(.1, .2, .3, .4, .5, .6, .7, .8, .9))
count_deciles
## 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 2.0 6.0 12.0 22.0 42.0 65.6 108.0 188.4 363.9
oft_rated_films_df <- mlData %>%
group_by(movie_title) %>%
summarise(
rating_count = n(),
median_rating = median(rating),
mean_rating = mean(rating)
) %>%
filter(rating_count > count_deciles[9])
head(oft_rated_films_df)
## # A tibble: 6 x 4
## movie_title rating_count median_rating mean_rating
## <fct> <int> <dbl> <dbl>
15
## 1 2001: A Space Odyssey (1968) 1036 4 3.97
## 2 Abyss, The (1989) 604 4 3.59
## 3 African Queen, The (1951) 608 4 4.18
## 4 Air Force One (1997) 862 4 3.63
## 5 Aladdin (1992) 876 4 3.81
## 6 Alien (1979) 1164 4 4.03
rarely_rated_films_df <- mlData %>%
group_by(movie_title) %>%
summarise(
rating_count = n(),
median_rating = median(rating),
mean_rating = mean(rating)
) %>%
filter(rating_count < count_deciles[3])
head(rarely_rated_films_df)
## # A tibble: 6 x 4
## movie_title rating_count median_rating mean_rating
## <fct> <int> <dbl> <dbl>
## 1 1-900 (1994) 5 3 2.6
## 2 3 Ninjas: High Noon At Mega Mountain (~ 10 1 1
## 3 8 Heads in a Duffel Bag (1997) 4 4 3.25
## 4 8 Seconds (1994) 4 4 3.75
## 5 A Chef in Love (1996) 8 4 4.12
## 6 Á köldum klaka (Cold Fever) (1994) 2 3 3
mean(oft_rated_films_df$mean_rating)
## [1] 3.631668
mean(rarely_rated_films_df$mean_rating)
## [1] 2.621521
mean(mlData_aggregates$mean_rating)
## [1] 3.078132
Problem 3: Correlation: Men versus women
Let us look more closely at the relationship between the pieces of data we have.
• Make a scatter plot of men versus women and their mean rating for every movie.
• Make a scatter plot of men versus women and their mean rating for movies rated more than 200 times.
• Compute the correlation coefficent between the ratings of men and women.
– When we compare mean ratings between men and women while including movies with 100 or fewer
ratings, the correlation between mean rating among men and mean rating among women appears
positive but not very strong for prediction. The correlation coefficient in this case is 0.5149489.
When considering movies with more than 100 ratings the correlation is stronger with a correlation
coefficient in this case of 0.8042434.
– Considering movies with more than 100 ratings, the relationship between mean men rating and
mean women rating is linear for the most part. This is more true near the mean of the mean
ratings (about 3.5) where a rating of about 3.5 among men corresponds to a mean rating of about
3.5 among women. The relation appears not quite linear for high and low mean ratings.
## Warning: Removed 217 rows containing missing values (geom_point).
16
1
2
3
4
5
1 2 3 4 5
Mean Rating Among Men
Mean Rating Among Women
1000
2000
Number of Ratings
Mean rating comparison for each film
Relationship Between Ratings by Women vs. Men
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
17
3
4
2.5 3.0 3.5 4.0 4.5
Mean Rating Among Men
Mean Rating Among Women
1000
2000
Number of Ratings
Mean rating comparison among films with over 200 ratings
Relationship Between Ratings by Women vs. Men
• Conjecture under what circumstances the rating given by one gender can be used to predict the rating
given by the other gender.
– Question: Are men and women more similar when they are younger or older?
–
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
filter(gender == "M", age > 40) %>%
summarise(mean_rating_men40plus = mean(rating)) %>%
full_join(mlData_aggregates)
## Joining, by = "movie_title"
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
filter(gender == "W", age > 40) %>%
summarise(mean_rating_women40plus = mean(rating)) %>%
full_join(mlData_aggregates)
## Joining, by = "movie_title"
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
filter(gender == "M", age < 30) %>%
summarise(mean_rating_men30minus = mean(rating)) %>%
full_join(mlData_aggregates)
## Joining, by = "movie_title"
18
mlData_aggregates <- mlData %>%
group_by(movie_title) %>%
filter(gender == "W", age < 30) %>%
summarise(mean_rating_women30minus = mean(rating)) %>%
full_join(mlData_aggregates)
## Joining, by = "movie_title"
old_scatterplot <- mlData_aggregates %>%
ggplot(aes(x = mean_rating_men40plus, y = mean_rating_women40plus)) +
geom_point(aes(color = rating_count), alpha = .5) +
labs(
title = "Relationship Between Ratings by Women vs. Men Over 40",
x = "Mean Rating Among Men",
y = "Mean Rating Among Women",
color = "Number of Ratings"
)
old_scatterplot
## Warning: Removed 1662 rows containing missing values (geom_point).
1 2 3 4 5
Mean Rating Among Men
Mean Rating Among Women
1000
2000
Number of Ratings
Relationship Between Ratings by Women vs. Men Over 40
young_scatterplot <- mlData_aggregates %>%
ggplot(aes(x = mean_rating_men30minus, y = mean_rating_women30minus)) +
geom_point(aes(color = rating_count), alpha = .5) +
labs(
title = "Relationship Between Ratings by Women vs. Men Over 40",
x = "Mean Rating Among Men",
19
y = "Mean Rating Among Women",
color = "Number of Ratings"
)
young_scatterplot
## Warning: Removed 1662 rows containing missing values (geom_point).
1 2 3 4 5
Mean Rating Among Men
Mean Rating Among Women
1000
2000
Number of Ratings
Relationship Between Ratings by Women vs. Men Over 40
gender_mean_corr_old = cor(
mlData_aggregates$mean_rating_men40plus,
mlData_aggregates$mean_rating_women40plus,
)
gender_mean_corr_young = cor(
mlData_aggregates$mean_rating_men30minus,
mlData_aggregates$mean_rating_women30minus
)
Problem 4: Open Ended Question: Business Intelligence
• From the exploration, I would suggest marketing films to lawyers, doctors, and educators of both
genders. If we have discovered anything from this dataset, it is that men and women really do not
differ significantly in their preferences and trying to make business decisions based on this factor is
not recommended. Consider marketing Film-Noir and War Movies. If the film is released and not well
received by the first few critics, elicit ratings from more critics. Generally, by increasing the number of
ratings, the film is likely to improve it’s overall mean and median ratings.
20