Exploratory Data Analysis

Now that we have our tidied dataset, we can conduct some exploratory analyses to answer the 3 following questions:

How has the distribution of movies that pass the Bechdel Test changed over time?
How do movies who pass differing levels of each Bechdel Test vary in terms of their budget?
How do movies who pass differing levels of each Bechdel Test vary in terms of their revenue (profit and ROI)?
Do movies that pass vs. fail the Bechdel Test differ in terms of fan or critic ratings?

Distribution of Bechdel Test Scores

First, we want to examine the distribution of movies according to Bechdel Test dimension. The code chunk below summarizes the proportion of movies in movies_df according to their Bechdel Test Score, as well as the proportion of movies that pass or fail the Bechdel test.

movies_df %>% 
  group_by(pass_bechdel) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) %>% 
  rename("Passes Bechdel Test" = pass_bechdel) %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Passes Bechdel Test	N	Proportion
FALSE	849	0.473
TRUE	945	0.527

movies_df %>% 
  group_by(bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) %>% 
  rename("Bechdel Test Criterion" = bechdel_score) %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Bechdel Test Criterion	N	Proportion
Don’t talk to each other	514	0.287
Dubious	142	0.079
Less than 2 women	141	0.079
Only talk about men	194	0.108
Passes Bechdel	803	0.448

The tables indicate slightly over half of the movies in the data pass the Bechdel Test, with 52.7% of movies passing and 47.3% of movies failing. Examining these categories in more detail, we can see that among those movies that don’t pass the Bechdel Test, most movies have women that don’t talk to each other. Among movies that pass the Bechdel Test, we can see only a small proportion of these movies have a “dubious” or debatable pass.

Bechdel Test Scores by decade

Next, we want to explore whether the number of movies passing the Bechdel Test have increased over time. The code chunk below plots the distribution of movies according to their Bechdel score by decade.

seventies_df = movies_df %>% 
  filter(decade == "1970-1979") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N))

eighties_df = movies_df %>% 
  filter(decade == "1980-1989") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) 

nineties_df = movies_df %>% 
  filter(decade == "1990-1999") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N))

thousands_df = movies_df %>% 
  filter(decade == "2000-2009") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N))

tens_df = movies_df %>% 
  filter(decade == "2010-2013") %>% 
  group_by(decade, bechdel_score) %>%  
  summarise(N = n()) %>%
  mutate(Proportion = N/sum(N))

decades = bind_rows(seventies_df, eighties_df, nineties_df, thousands_df, tens_df)

decades %>% 
  plot_ly(x = ~decade, y = ~Proportion*100, type = "bar", color = ~bechdel_score, colors = "RdYlGn") %>% 
  layout(barmode = "stack",
         title = "Distribution of movies passing Bechdel Test Criteria by decade",
         xaxis = list(title = "Decade"),
         yaxis = list(title = "Percentage (%)"))

This chart shows the breakdown of our sample of films stratified by decade. Looking at the plot, we can see that since 1970, an increasing proportion of movies pass the Bechdel Test, however this level has seemed to plateau around 54-55% since the 2000’s. This may be partially attributed to a smaller sample of films in the 2010s. Despite this, the proportion of movies that decisively pass the Bechdel Test (‘Dubious’ means contributors were skeptical about whether the films in question passed the test) remains under 50%.

Distribution of budgets according to Bechdel criteria

Next, we want to explore how movie budgets may vary according to the primacy of women’s roles in movies. Grouping by Bechdel Test score, we can compute the distribution of movie budget, adjusted to 2013 inflation.

movies_df %>% 
  mutate(pass_bechdel = ifelse(pass_bechdel == TRUE, "Pass", "Fail")) %>% 
  plot_ly(x = ~budget_2013, y = ~pass_bechdel, type = "box", text = ~title, color = ~pass_bechdel, colors = c("red4", "green4")) %>% 
  layout(xaxis = list(title = "Movie budget ($)"),
         yaxis = list(title = "Bechdel Test"), 
         showlegend = FALSE)

movies_df %>% 
  plot_ly(x = ~budget_2013, y = ~bechdel_score, type = "box", text = ~title, color = ~bechdel_score, colors = c("darkred","orange1", "green4")) %>% 
  layout(xaxis = list(title = "Movie budget ($)"),
         yaxis = list(title = "Bechdel Criterion"),
         showlegend = FALSE, 
         title = "Movie budgets (2013-adjusted) stratified by Bechdel Test criteria")

This chart shows the range of movie budgets stratified by criteria of the Bechdel Test. Movies that pass the Bechdel Test appear to have slightly smaller budgets than movies that don’t pass when we stratify on a binary basis. We notice that film budgets are fairly right-skewed, indicating we will need to log-transform this variable in subsequent analyses. When stratifying by the categorical Bechdel score, the budgets are not significantly different from each other, however, it appears that the interquartile range for movies that firmly pass the Bechdel Test is slightly lower than movies that don’t firmly pass the Bechdel Test. These numbers may suggest that Hollywood puts more money behind male-only films than films in which women talk to each other, however, this will be further explored in subsequent analyses

Distribution of revenue according to Bechdel criteria

Additionally, we want to explore how movie revenues may vary according to the primacy of women’s roles in movies. We can examine this using the profit and ROI variables in our data.

Profits

Starting with profit, which is the difference between intgross_2013 (2013-adjusted international gross revenue) and budget_2013 (2013-adjusted budget), we can plot the range of profits for each movie, stratified by their Bechdel Test criteria.

movies_df %>% 
  mutate(pass_bechdel = ifelse(pass_bechdel == TRUE, "Pass", "Fail")) %>% 
  plot_ly(x = ~profit, y = ~pass_bechdel, type = "box", text = ~title, color = ~bechdel_score, colors = c("darkred","orange1", "green4")) %>% 
  layout(boxmode = "group", 
         xaxis = list(title = "Profit ($)"),
         yaxis = list(title = "Bechdel Criterion"),
         showlegend = FALSE,
         title = "Movie profits stratified by Bechdel Test criteria")

Looking at the resulting plot, we see that the distribution of profits between movies passing vs. failing the Bechdel test are quite similar, with movies that fail the Bechdel Test reeling in slightly higher profits. The profits overall appear quite right-skewed, with some outliers bringing in similar amounts of profit (e.g. Star Wars and the Titanic). Given the skewed-ness of this outcome, we will need to log-transform this variable in subsequent analyses. When we stratify further by each dimension of the Bechdel Test, there are no notable differences in the profits between Bechdel criteria.

ROI

Next, let’s examine ROI, which is the ratio of profit to budget_2013. Because ROI is heavily right-skewed, we will log-transform this variable to better visualize differences in the ROI distribution between movies of differing Bechdel criteria.

movies_df %>% 
  mutate(pass_bechdel = ifelse(pass_bechdel == TRUE, "Pass", "Fail")) %>% 
  plot_ly(x = ~log(ROI), y = ~pass_bechdel, type = "box", text = ~title, color = ~bechdel_score, colors = c("darkred","orange1", "green4")) %>% 
  layout(boxmode = "group", 
         xaxis = list(title = "Log(Return on Investment)"),
         yaxis = list(title = "Bechdel Criterion"),
         showlegend = FALSE,
         title = "Movie ROI stratified by Bechdel Test criteria")

Examining the log(ROI) across Bechdel dimensions, the range of financial performance does not appear to differ substantially, as the median and IQR are approximately the same across all categories. Therefore, despite having slightly lower budgets, we can see that the primacy of women’s roles in movies do not negatively impact movies’ financial performance, and therefore refutes claims that films with stronger female characters perform any worse at the box office than films without them.

Bechdel Test by IMDB Rating

Although financial performance is an important indicator of success for films, we may also want to consider how well movies are received by fans and critics depending on the primacy of female roles. We can plot IMDB ratings (fan ratings) against Metacritic Scores (aggregate scores based on critic reviews) and stratify by Bechdel Score.

movies_df %>% 
  plot_ly(x = ~imdb_rating, y = ~metascore/10, color = ~bechdel_score, type = "scatter", 
          mode = "markers", colors = c("darkred","orange1", "green4"), text = ~title, alpha = 0.6) %>% 
  layout(yaxis = list(title = list(text = "Metascore", standoff = 5), gridcolor = "lightgray"),
         xaxis = list(title = "IMDB Rating"), gridcolor = "gray",
         title = "Movie IMDB and Metacritic Scores stratified by Bechdel Test criteria")

Based on the scatterplot, we do not see any discernable clusters that indicate movies that pass the Bechdel test perform better or worse than movies that do not pass the Bechdel Test. However, we do notice that there films with high outlying fan and critic scores appear to fail the Bechdel Test. Additionally, we see that there are more films passing the Bechdel Test with higher critic ratings relative to fan ratings, and more films that fail the Bechdel Test with lower critic ratings relative to fan ratings. This could perhaps be inherent biases held by fans to be more critical of films with female-dominant roles, while critics may be more likely to isolate their own biases while judging films.