Female Representation in Film: Does Girl Power Translate to Success?

Topic

An analysis of movies released between 1970-2013 that pass the Bechdel test, and looking at their association with movie budget, profit, and other factors.

Motivation

The Bechdel test is an indicator of the active presence of women in the entire field of film and other fiction, and calls attention to gender inequality in fiction. Some studies indicate that films that pass the test perform better financially while receiving a lower budget, compared to those that do not. Thus, through this project, we intend:

  • To examine the number of movies that pass the Bechdel test over time (1970 - 2013)
  • To examine the relationship between female representation in movies and movie budget
  • To examine the relationship between female representation in movies and movie profit

Inspiration

Female representation is growing by day in every sector, with particular attention given to it in film. We, a group of movie nerdgirls, wanted to look for progressive movies that signify higher representation of women or where women are completely centered in the story. The Bechdel Test has been a simple and popular measure of feminism in works of fiction. The rules of the test were first defined back in 1985 in a comic strip featuring two queer women who could not find a movie that

  • Had at least two women in it
  • who talked to each other,
  • about something other than a man.

We came across articles talking about the importance of these tests in the industry, and some previous studies that used this test to determine if the movie had significant representation for female character(s). Using this information, we further wanted to explore the prioritization of these movies in the industry and their success through an examination of their budget, profits, and ratings.

Research Questions

Initially, our motivation for this project was prompted by the following questions: * How many movies from 1970-2013 pass the Bechdel test and has the number changed over time? * What is the association between the movies that pass the Bechdel test and the movie’s budget? * What is the association between the movies that pass the Bechdel test and the movie’s profit?

As we delved into the data, we further wanted to explore: * What is the association between movies that pass the Bechdel test and their review ratings?

Data Sources

  • FiveThirtyEight: The FiveThirtyEight Bechdel dataset contains data about 1794 films released from 1970 to 2013. We used this data to examine the relationship between the prominence of women in a film and that film’s budget and profits. The dataset consists of variables including title, year, budget, domestic and international gross revenue, budget and revenues adjusted for 2013 inflation, and the Bechdel test (evaluated both as binary pass/fail and categorically evaluating the three criteria).

  • Bechdel Test: This database consists of a list of 9802 movies released between 1874 and 2022 (as of Dec 8th, 2022), and results for whether they pass or fail the Bechdel test. Variables include title of the film, date when the film was added, reviews and comments, and IMDB ratings. The data can be accessed through API.

  • The-Numbers: This is a film industry data website that tracks box office revenue in a systematic and algorithmic way. We accounted for the budget and revenue variables for the movies using this website. The data can be accessed through API.

  • IMDb: The Internet Movie Database (IMDb) is an online database containing information and statistics about movies, TV shows, and video games as well as actors, directors, and other film industry professionals. We used the information about the year of release, runtime, genre, language and rating of the films.

  • Tidy Tuesday Project: This is the main source for our dataset which provides weekly releases of raw datasets for users to wrangle and analyse. The data itself originates from FiveThirtyEight that contains movies ranging from 1970 to 2013, and merges data from the other sources: IMDb, The-Numbers.com, and BechdelTest.com.

Data Cleaning

We imported the following datasets from the Tidy Tuesday project using read_csv from ‘readr’ library. This dataset was divided into 2 files: movies.csv and raw_bechdel.csv, initially.

raw_bechdel = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/raw_bechdel.csv')
movies = readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-03-09/movies.csv')

We mainly used movies.csv for our analysis. movies.csv included movies from 1970 to 2013; and contains the following variables year, imdb, title, test, clean_test, binary, budget, domgross, intgross, code, budget_2013, domgross_2013, intgross_2013, period_code, decade_code, imdb_id, plot, rated, response, language, country, writer, metascore, imdb_rating, director, released, actors, genre, awards, runtime, type, poster, imdb_votes, error

For data cleaning,

  • We checked for the variables and converted the following variables from character to numeric form as required:domgross, intgross, domgross_2013, intgross_2013 using as.numeric function in the tidyverse library.

  • The ‘binary’ variable in character form was converted to logical form to indicate whether the movie had passed or failed the Bechdel test.

  • Numeric ‘Decade_code’ variable was re-coded and re-leveled into a factor decade variable according to the year of release. 5 levels were created for this ‘decade’ variable ranging from 1970-79, 1980-89, 1990-99, 2000-09, 2010 to 2013.

  • Re-coded and re-leveled the clean_test variable into 4 levels corresponding to the Bechdel Test criteria:

    • 0 = Less than 2 women
    • 1 = Women don’t talk to each other
    • 2 = Women only talk about men
    • 3 = Dubious (uncertain whether it passes Bechdel Test)
    • 4 = Passes Bechdel Test
  • Created an award_winner variable that is TRUE if the movie has won at least 1 Oscar, Golden Globe, or BAFTA award

  • Split the genre character variable into dummy-coded variables indicating whether the movie falls under up to three of 20 different genres

  • Created a logical variable pass_bechdel that indicates whether the movie passes or fails the Bechdel Test based on the bechdel_score variable

  • Created profit and ROI variables for movies. We used worldwide box office revenue instead of domestic revenue.

    • profit = intgross_2013 – budget_2013
    • ROI = profit / budget_2013
  • Renamed key variables to relevant names, and removed extra variables

movies_df = movies %>% 
  mutate(test = as.factor(test), 
         clean_test = fct_recode(clean_test, 
                                 "Less than 2 women" = "nowomen", 
                                 "Don't talk to each other" = "notalk",
                                 "Only talk about men" = "men", 
                                 "Dubious" = "dubious",
                                 "Passes Bechdel" = "ok"),
         clean_test = fct_relevel(clean_test, c("Less than 2 women", "Don't talk to each other", 
                                                "Only talk about men", "Dubious", "Passes Bechdel")),
         binary = ifelse(clean_test == "Dubious" | clean_test == "Passes Bechdel", TRUE, FALSE), 
         domgross = as.numeric(domgross),
         intgross = as.numeric(intgross),
         domgross_2013 = as.numeric(domgross_2013),
         intgross_2013 = as.numeric(intgross_2013),
         decade_code = case_when(year >= 1970 & year < 1980 ~ "1970-1979",
                            year >= 1980 & year < 1990 ~ "1980-1989",
                            year >= 1990 & year < 2000 ~ "1990-1999",
                            year >= 2000 & year < 2010 ~ "2000-2009",
                            year >= 2010 & year < 2020 ~ "2010-2013"),
         decade_code = as.factor(decade_code),
         title = str_replace(title, "&#39;", "'"),
         title = str_replace(title, "&amp;", "&"),
         title = str_replace(title, "&agrave;", "à"),
         title = str_replace(title, "&aring;", "å"),
         title = str_replace(title, "&auml;", "ä"), 
         runtime = as.numeric(str_replace(runtime, " min", "")),
         award_winner = ifelse(str_detect(awards, "Won") & (str_detect(awards, "Golden Globe") | 
                                                              str_detect(awards, "Oscar") | str_detect(awards, "BAFTA")), T, F), 
         profit = intgross_2013 - budget_2013,
         ROI = profit/budget_2013) %>% 
  separate(genre, into = c("g1", "g2", "g3"), sep = ", ") %>% 
  rename("pass_bechdel" = binary,
         "bechdel_score" = clean_test,
         "decade" = decade_code) %>% 
  mutate(
    action = ifelse(g1 == "Action" | g2 == "Action" | g3 == "Action", TRUE, FALSE),
    adventure = ifelse(g1 == "Adventure" | g2 == "Adventure" | g3 == "Adventure", TRUE, FALSE),
    animation = ifelse(g1 == "Animation" | g2 == "Animation" | g3 == "Animation", TRUE, FALSE),
    biography = ifelse(g1 == "Biography" | g2 == "Biography" | g3 == "Biography", TRUE, FALSE),
    comedy = ifelse(g1 == "Comedy" | g2 == "Comedy" | g3 == "Comedy", TRUE, FALSE),
    crime = ifelse(g1 == "Crime" | g2 == "Crime" | g3 == "Crime", TRUE, FALSE),
    documentary = ifelse(g1 == "Documentary" | g2 == "Documentary" | g3 == "Documentary", TRUE, FALSE),
    drama = ifelse(g1 == "Drama" | g2 == "Drama" | g3 == "Drama", TRUE, FALSE),
    family = ifelse(g1 == "Family" | g2 == "Family" | g3 == "Family", TRUE, FALSE),
    fantasy = ifelse(g1 == "Fantasy" | g2 == "Fantasy" | g3 == "Fantasy", TRUE, FALSE),
    history = ifelse(g1 == "History" | g2 == "History" | g3 == "History", TRUE, FALSE),
    horror = ifelse(g1 == "Horror" | g2 == "Horror" | g3 == "Horror", TRUE, FALSE),
    music = ifelse(g1 == "Music" | g2 == "Music" | g3 == "Music", TRUE, FALSE),
    musical = ifelse(g1 == "Musical" | g2 == "Musical" | g3 == "Musical", TRUE, FALSE),
    mystery = ifelse(g1 == "Mystery" | g2 == "Mystery" | g3 == "Mystery", TRUE, FALSE),
    romance = ifelse(g1 == "Romance" | g2 == "Romance" | g3 == "Romance", TRUE, FALSE),
    sci_fi = ifelse(g1 == "Sci-Fi" | g2 == "Sci-Fi" | g3 == "Sci-Fi", TRUE, FALSE),
    sport = ifelse(g1 == "Sport" | g2 == "Sport" | g3 == "Sport", TRUE, FALSE),
    thriller = ifelse(g1 == "Thriller" | g2 == "Thriller" | g3 == "Thriller", TRUE, FALSE),
    war = ifelse(g1 == "War" | g2 == "War" | g3 == "War", TRUE, FALSE),
    western = ifelse(g1 == "Western" | g2 == "Western" | g3 == "Western", TRUE, FALSE)
  ) %>% 
  mutate(across(action:western, ~replace_na(., FALSE))) %>%
  select(year, title, bechdel_score, pass_bechdel, budget_2013:intgross_2013, decade, imdb_id, metascore, imdb_rating, award_winner, runtime, profit, ROI, action:western) 

skimr::skim(movies_df) %>% 
  knitr::kable(digits = 3) %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) %>% 
  kableExtra::scroll_box(width = "100%", height = "300px")
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace factor.ordered factor.n_unique factor.top_counts logical.mean logical.count numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character title 0 1.000 1 83 0 1768 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
character imdb_id 0 1.000 6 8 0 1794 0 NA NA NA NA NA NA NA NA NA NA NA NA NA
factor bechdel_score 0 1.000 NA NA NA NA NA FALSE 5 Pas: 803, Don: 514, Onl: 194, Dub: 142 NA NA NA NA NA NA NA NA NA NA
factor decade 0 1.000 NA NA NA NA NA FALSE 5 200: 840, 201: 438, 199: 337, 198: 125 NA NA NA NA NA NA NA NA NA NA
logical pass_bechdel 0 1.000 NA NA NA NA NA NA NA NA 0.527 TRU: 945, FAL: 849 NA NA NA NA NA NA NA NA
logical award_winner 202 0.887 NA NA NA NA NA NA NA NA 0.119 FAL: 1402, TRU: 190 NA NA NA NA NA NA NA NA
logical action 0 1.000 NA NA NA NA NA NA NA NA 0.247 FAL: 1351, TRU: 443 NA NA NA NA NA NA NA NA
logical adventure 0 1.000 NA NA NA NA NA NA NA NA 0.200 FAL: 1435, TRU: 359 NA NA NA NA NA NA NA NA
logical animation 0 1.000 NA NA NA NA NA NA NA NA 0.060 FAL: 1686, TRU: 108 NA NA NA NA NA NA NA NA
logical biography 0 1.000 NA NA NA NA NA NA NA NA 0.043 FAL: 1717, TRU: 77 NA NA NA NA NA NA NA NA
logical comedy 0 1.000 NA NA NA NA NA NA NA NA 0.312 FAL: 1234, TRU: 560 NA NA NA NA NA NA NA NA
logical crime 0 1.000 NA NA NA NA NA NA NA NA 0.144 FAL: 1535, TRU: 259 NA NA NA NA NA NA NA NA
logical documentary 0 1.000 NA NA NA NA NA NA NA NA 0.002 FAL: 1790, TRU: 4 NA NA NA NA NA NA NA NA
logical drama 0 1.000 NA NA NA NA NA NA NA NA 0.412 FAL: 1055, TRU: 739 NA NA NA NA NA NA NA NA
logical family 0 1.000 NA NA NA NA NA NA NA NA 0.060 FAL: 1687, TRU: 107 NA NA NA NA NA NA NA NA
logical fantasy 0 1.000 NA NA NA NA NA NA NA NA 0.100 FAL: 1614, TRU: 180 NA NA NA NA NA NA NA NA
logical history 0 1.000 NA NA NA NA NA NA NA NA 0.027 FAL: 1745, TRU: 49 NA NA NA NA NA NA NA NA
logical horror 0 1.000 NA NA NA NA NA NA NA NA 0.095 FAL: 1623, TRU: 171 NA NA NA NA NA NA NA NA
logical music 0 1.000 NA NA NA NA NA NA NA NA 0.021 FAL: 1757, TRU: 37 NA NA NA NA NA NA NA NA
logical musical 0 1.000 NA NA NA NA NA NA NA NA 0.012 FAL: 1773, TRU: 21 NA NA NA NA NA NA NA NA
logical mystery 0 1.000 NA NA NA NA NA NA NA NA 0.080 FAL: 1651, TRU: 143 NA NA NA NA NA NA NA NA
logical romance 0 1.000 NA NA NA NA NA NA NA NA 0.133 FAL: 1556, TRU: 238 NA NA NA NA NA NA NA NA
logical sci_fi 0 1.000 NA NA NA NA NA NA NA NA 0.114 FAL: 1590, TRU: 204 NA NA NA NA NA NA NA NA
logical sport 0 1.000 NA NA NA NA NA NA NA NA 0.016 FAL: 1766, TRU: 28 NA NA NA NA NA NA NA NA
logical thriller 0 1.000 NA NA NA NA NA NA NA NA 0.169 FAL: 1490, TRU: 304 NA NA NA NA NA NA NA NA
logical war 0 1.000 NA NA NA NA NA NA NA NA 0.016 FAL: 1765, TRU: 29 NA NA NA NA NA NA NA NA
logical western 0 1.000 NA NA NA NA NA NA NA NA 0.007 FAL: 1782, TRU: 12 NA NA NA NA NA NA NA NA
numeric year 0 1.000 NA NA NA NA NA NA NA NA NA NA 2.002552e+03 8.980000e+00 1970.0 1998.000 2005.000 2.009000e+03 2.013000e+03 ▁▁▂▅▇
numeric budget_2013 0 1.000 NA NA NA NA NA NA NA NA NA NA 5.546461e+07 5.491864e+07 8632.0 16068917.500 36995786.000 7.833790e+07 4.614359e+08 ▇▂▁▁▁
numeric domgross_2013 18 0.990 NA NA NA NA NA NA NA NA NA NA 9.517478e+07 1.259653e+08 899.0 20546593.750 55993640.500 1.216784e+08 1.771683e+09 ▇▁▁▁▁
numeric intgross_2013 11 0.994 NA NA NA NA NA NA NA NA NA NA 1.978380e+08 2.835079e+08 899.0 33232603.500 96239640.000 2.414790e+08 3.171931e+09 ▇▁▁▁▁
numeric metascore 378 0.789 NA NA NA NA NA NA NA NA NA NA 5.900100e+01 1.723800e+01 11.0 47.000 60.000 7.200000e+01 1.000000e+02 ▁▅▇▆▂
numeric imdb_rating 202 0.887 NA NA NA NA NA NA NA NA NA NA 6.760000e+00 9.630000e-01 2.1 6.200 6.800 7.400000e+00 9.300000e+00 ▁▁▅▇▂
numeric runtime 203 0.887 NA NA NA NA NA NA NA NA NA NA 1.109890e+02 2.091300e+01 55.0 97.000 107.000 1.220000e+02 2.710000e+02 ▅▇▁▁▁
numeric profit 11 0.994 NA NA NA NA NA NA NA NA NA NA 1.421113e+08 2.547185e+08 -120328632.0 5802238.500 54966804.000 1.645158e+08 3.024172e+09 ▇▁▁▁▁
numeric ROI 11 0.994 NA NA NA NA NA NA NA NA NA NA 5.197000e+00 2.081800e+01 -1.0 0.346 1.657 4.042000e+00 4.305170e+02 ▇▁▁▁▁

This resulting dataset movies_df contains 1794 observations and 36 variables, indicating information about each movie’s Bechdel test score, budget, revenue, genre, and ratings.

EDA

We looked at the overall distribution of movies from 1970 to 2013 that passed or failed the Bechdel test, and distribution of movies as per Bechdel test scores.

movies_df %>% 
  group_by(pass_bechdel) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) %>% 
  rename("Passes Bechdel Test" = pass_bechdel) %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Passes Bechdel Test N Proportion
FALSE 849 0.473
TRUE 945 0.527
movies_df %>% 
  group_by(bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) %>% 
  rename("Bechdel Test Criterion" = bechdel_score) %>% 
  knitr::kable(digits = 3) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
Bechdel Test Criterion N Proportion
Less than 2 women 141 0.079
Don’t talk to each other 514 0.287
Only talk about men 194 0.108
Dubious 142 0.079
Passes Bechdel 803 0.448

The findings suggested that over half of the movies, i.e. 52.7% movies in the data passed the Bechdel Test. Stratifying by levels of Bechdel scores, it was seen that among the movies that passed the test, a small proportion had a dubious passing status. Among those that failed the test, most movies had women who didn’t talk to each other.

We conducted a further exploratory data analysis to answer our main questions:

1. How has the distribution of movies that pass the Bechdel Test changed over time?

seventies_df = movies_df %>% 
  filter(decade == "1970-1979") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N))

eighties_df = movies_df %>% 
  filter(decade == "1980-1989") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N)) 

nineties_df = movies_df %>% 
  filter(decade == "1990-1999") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N))

thousands_df = movies_df %>% 
  filter(decade == "2000-2009") %>% 
  group_by(decade, bechdel_score) %>% 
  summarise(N = n()) %>% 
  mutate(Proportion = N/sum(N))

tens_df = movies_df %>% 
  filter(decade == "2010-2013") %>% 
  group_by(decade, bechdel_score) %>%  
  summarise(N = n()) %>%
  mutate(Proportion = N/sum(N))

decades = bind_rows(seventies_df, eighties_df, nineties_df, thousands_df, tens_df)

decades %>% 
  plot_ly(x = ~decade, y = ~Proportion*100, type = "bar", color = ~bechdel_score, colors = "RdYlGn") %>% 
  layout(barmode = "stack",
         title = "Distribution of movies passing Bechdel Test Criteria by decade",
         xaxis = list(title = "Decade"),
         yaxis = list(title = "Percentage (%)"))

This concluded that the proportion of movies that passed the Bechdel test increased over the decades. Due to a smaller sample of movies in the last decade (2010-13), we that the results seemed to have had leveled-off since 2000s. Also, the proportion of movies that decisively passed the test is still below 50%.

2. How do the movies that pass the Bechdel’s test associate with the movie budget?

movies_df %>% 
  mutate(pass_bechdel = ifelse(pass_bechdel == TRUE, "Pass", "Fail")) %>% 
  plot_ly(x = ~budget_2013, y = ~pass_bechdel, type = "box", text = ~title, color = ~pass_bechdel, colors = c("red4", "green4")) %>% 
  layout(xaxis = list(title = "Movie budget ($)"),
         yaxis = list(title = "Bechdel Test"), 
         showlegend = FALSE)
movies_df %>% 
  plot_ly(x = ~budget_2013, y = ~bechdel_score, type = "box", text = ~title, color = ~bechdel_score, colors = c("darkred","orange1", "green4")) %>% 
  layout(xaxis = list(title = "Movie budget ($)"),
         yaxis = list(title = "Bechdel Criterion"),
         showlegend = FALSE, 
         title = "Movie budgets (2013-adjusted) stratified by Bechdel Test criteria")

It was seen that the movies that passed the Bechdel test appeared to have smaller budgets than movies that don’t pass. On stratifying by the categorical Bechdel score, the budgets were not significantly different from each other. This possibly suggests that Hollywood puts more money behind male-only films than films in which women talk to each other.

3. How do the movies that pass the Bechdel’s test associate with the movie profit?

movies_df %>% 
  mutate(pass_bechdel = ifelse(pass_bechdel == TRUE, "Pass", "Fail")) %>% 
  plot_ly(x = ~profit, y = ~pass_bechdel, type = "box", text = ~title, color = ~bechdel_score, colors = c("darkred","orange1", "green4")) %>% 
  layout(boxmode = "group", 
         xaxis = list(title = "Profit ($)"),
         yaxis = list(title = "Bechdel Criterion"),
         showlegend = FALSE,
         title = "Movie profits stratified by Bechdel Test criteria")

Movies failing the Bechdel test seemed to make more profit as compared to the ones that passed the test. The distribution of these profits across movies that passed and failed the Bechdel test was not appreciably different. Upon stratifying, there were no notable differences in the profits between Bechdel criteria.

4. How do the movies that pass the Bechdel’s test associate with the ROI?

We also checked for ROI to check for the ratio of profit to the budget (2013 inflation). Because ROI was heavily right-skewed, we used log-transform for this variable to visualize differences in the ROI distribution between movies of differing Bechdel criteria.

movies_df %>% 
  mutate(pass_bechdel = ifelse(pass_bechdel == TRUE, "Pass", "Fail")) %>% 
  plot_ly(x = ~log(ROI), y = ~pass_bechdel, type = "box", text = ~title, color = ~bechdel_score, colors = c("darkred","orange1", "green4")) %>% 
  layout(boxmode = "group", 
         xaxis = list(title = "Log(Return on Investment)"),
         yaxis = list(title = "Bechdel Criterion"),
         showlegend = FALSE,
         title = "Movie ROI stratified by Bechdel Test criteria")

Despite having slightly lower budgets, we could see that the primacy of women’s roles in movies did not negatively impact movies’ financial performance. This contradicts the claims that films with stronger female characters perform any worse at the box office than films without them.

5. How do the movies that pass the Bechdel’s test associate with the IMDb ratings?

We wanted to see how the fans received the movies that passed the Bechdel test. We plotted IMDB ratings (fan ratings) against Metacritic Scores (aggregate scores based on critic reviews) and stratified by Bechdel Score.

movies_df %>% 
  plot_ly(x = ~imdb_rating, y = ~metascore/10, color = ~bechdel_score, type = "scatter", 
          mode = "markers", colors = c("darkred","orange1", "green4"), text = ~title, alpha = 0.6) %>% 
  layout(yaxis = list(title = list(text = "Metascore", standoff = 5), gridcolor = "lightgray"),
         xaxis = list(title = "IMDB Rating"), gridcolor = "gray",
         title = "Movie IMDB and Metacritic Scores stratified by Bechdel Test criteria")

There was no appreciable difference between the movies that pass and fail the Bechdel test in terms of their performance. There were more films passing the Bechdel Test with higher critic ratings relative to fan ratings, and more films that failed the Bechdel Test with lower critic ratings relative to fan ratings. One potential reason can be inherent biases held by fans to be more critical of films with female-dominant roles, while critics may be more likely to isolate their own biases while judging films.

Linear Models

To test and determine the strength of association, we used linear regression models for categorical Bechdel score with outcomes budget, profit, imdb_rating and Metascores individually.

For these linear models, we used the "less than 2 women" as the reference category. The intercept was used to compare movies that fail all dimensions of the Bechdel test with the movies that pass all dimensions of the Bechdel test.

1. Linear regression of Budget vs. Categorical Bechdel Score

Because the budget was heavily right skewed, we log-transformed it to enforce normal distribution.

log_budget = movies_df %>% ggplot(aes(x = log(budget_2013))) + geom_histogram(alpha = 0.8, color = "white") + 
  labs(
    x = "Budget ($, 2013-adjusted)",
    y = "Count",
    title = "Distribution of movie budgets")
ggplotly(log_budget)

We quantitatively tested the association of the log of budget with categorical Bechdel score using linear regression model:

log_budget= movies_df %>%
  lm(log(budget_2013)~bechdel_score, data=.)%>% 
  broom::tidy()
  
knitr::kable(log_budget, digits=3)
term estimate std.error statistic p.value
(Intercept) 17.341 0.116 150.109 0.000
bechdel_scoreDon’t talk to each other 0.154 0.130 1.183 0.237
bechdel_scoreOnly talk about men -0.226 0.152 -1.488 0.137
bechdel_scoreDubious -0.078 0.163 -0.478 0.633
bechdel_scorePasses Bechdel -0.310 0.125 -2.475 0.013
  • Based on parameter estimate = -0.31 with a p value = 0.01, passing all 3 Bechdel dimensions had a negative relationship with the Budget of the movies. i.e., the movies that passed all 3 Bechdel dimensions had a lower budget than movies that failed every Bechdel dimension.

  • Movies that failed every Bechdel dimension had an average budget of $33,963,942

  • Movies that passed all 3 Bechdel dimensions had an average budget of $24,910,750

2. Linear regression of Profit vs. categorical Bechdel score

Because the profit was heavily right skewed, we log-transformed the data to enforce normal distribution. Additionally, before the logarithm was applied, 1 was added to the base value to prevent applying a logarithm to a 0 value. This was later subtracted while calculating the results.

profit = movies_df %>% ggplot(aes(x = log(profit))) + geom_histogram(alpha = 0.8, color = "white") + 
  labs(
    x = "Profit",
    y = "Count",
    title = "Distribution of Profits")
ggplotly(profit)

We quantitatively tested the association of profit with categorical Bechdel score using linear regression model. We regressed profit against the categories adjusting for budget.

profit = movies_df %>% 
  lm(log(profit+1)~bechdel_score, data =.) %>% 
  broom::tidy()

knitr::kable(profit, digits=3)
term estimate std.error statistic p.value
(Intercept) 18.139 0.160 113.117 0.000
bechdel_scoreDon’t talk to each other 0.080 0.179 0.447 0.655
bechdel_scoreOnly talk about men -0.012 0.208 -0.057 0.955
bechdel_scoreDubious 0.071 0.224 0.317 0.751
bechdel_scorePasses Bechdel -0.282 0.173 -1.625 0.104
  • Based on the parameter estimate = -0.044, Profits of movies that passed all 3 dimensions of the Bechdel test had lower profits than movies that failed all dimensions of the Bechdel test. However since the p-value is 0.7 > 0.05, the difference was not found to be statistically significant.

  • Movies that failed all dimensions of the Bechdel test made an average profit of $75,526,942

  • Movies that passed all dimensions of the Bechdel test made an average profit of $57,082,034

3. Linear regression of IMDB rating vs. categorical Bechdel score

imdb = movies_df %>% ggplot(aes(x = imdb_rating)) + geom_histogram(alpha = 0.8, color = "white") + 
  labs(
    x = "IMDB rating",
    y = "Count",
    title = "Distribution of IMDB")
ggplotly(imdb)

This graph showed an approximately normal distribution.

We quantitatively tested the association of IMDB rating with categorical Bechdel scores using linear regression

imdb = movies_df %>% 
  lm(imdb_rating~bechdel_score, data=.) %>% 
  broom::tidy()
knitr::kable(imdb, digits=3)
term estimate std.error statistic p.value
(Intercept) 6.944 0.085 81.848 0.000
bechdel_scoreDon’t talk to each other -0.032 0.095 -0.335 0.738
bechdel_scoreOnly talk about men -0.074 0.113 -0.653 0.514
bechdel_scoreDubious -0.208 0.121 -1.723 0.085
bechdel_scorePasses Bechdel -0.344 0.092 -3.729 0.000
  • Based on the Parameter estimate = -0.344 with a p value = 0.00019, movies that passed all 3 dimesions of Bechdel test had a negative relationship with IMDB ratings.

  • Movies that passed all dimensions of the Bechdel test received 0.34 points lower IMDB ratings than movies that failed all dimensions of the Bechdel test.

4. Linear regression of Metascore vs. categorical Bechdel score

meta = movies_df %>% ggplot(aes(x = metascore)) + geom_histogram(alpha = 0.8, color = "white") + 
  labs(
    x = "Meta score",
    y = "Count",
    title = "Distribution of Metascore")
ggplotly(meta)

The graph showed an approximately normal distribution.

We quantitatively tested the association of Metascore with categorical Bechdel scores using linear regression.

meta = movies_df %>% 
  lm(metascore~bechdel_score, data=.) %>% 
  broom::tidy()
knitr::kable(meta, digits=3)
term estimate std.error statistic p.value
(Intercept) 58.730 1.636 35.897 0.000
bechdel_scoreDon’t talk to each other 1.306 1.841 0.709 0.478
bechdel_scoreOnly talk about men -0.591 2.155 -0.274 0.784
bechdel_scoreDubious 2.032 2.324 0.874 0.382
bechdel_scorePasses Bechdel -0.464 1.775 -0.262 0.794
  • Based on the parameter estimate = -0.4642, movies that pass all dimensions of the Bechdel test had a negative relationship with Meta scores. However since the p-value 0.7 > 0.05, the difference was not statistically significant.

  • Movies that passed all dimensions of the Bechdel test received 0.46 points less in Metascore than movies that failed all dimensions of the Bechdel test.

Stepwise Regression

We used stepwise regression to check if the Bechdel score was an influential predictor of movie performance as measured by profit, budget_2013, and review ratings. The setup of the regression models here follows the methodology utilized for the linear regression models including log-transforming variables like profit and budget_2013 to enforce a normal distribution. For the review rating variable, we had a choice of using the Metacritic scores or the IMDB scores. Because imdb_rating proved to be statistically significantly associated with the Bechdel test in the linear regression analysis, we decided to use this variable over the metascore. We utilized a stepwise algorithm that optimized model AIC from the MASS package. First, we tested both the binary and categorical variables for the Bechdel test in the algorithm. Upon running the models, we found that the binary variable was more likely to be selected as a part of the final model due to having fewer degrees of freedom, so that became the variable used in the final models. Initially, we included return on investment, ROI as an outcome variable as well, but after running the models, we found that it displayed the same results as the profit variable. Because of this, it was removed from the final analyses.

We first assessed the Bechdel test as a potential predictor of profit. Other potential predictors included in the model were budget_2013, imdb_rating, and all genre variables.

# Stepwise regression for profit
## logProfit ~ Bechdel (binary) + logBudget + IMDB + Genre
modelprofit = lm(log(profit +1) ~ pass_bechdel + log(budget_2013) + imdb_rating + action + adventure + animation + biography + comedy + crime + documentary + drama + family + fantasy + history + horror + music + musical + mystery + romance + sci_fi + sport + thriller + war + western, data = movies_df)
stepprofit <- MASS::stepAIC(modelprofit, direction = "both", trace = FALSE) %>% broom::tidy()
knitr::kable(stepprofit, digits = 3)
term estimate std.error statistic p.value
(Intercept) 4.205 0.599 7.020 0.000
log(budget_2013) 0.646 0.030 21.721 0.000
imdb_rating 0.395 0.041 9.539 0.000
adventureTRUE 0.453 0.101 4.463 0.000
biographyTRUE -0.338 0.179 -1.886 0.060
dramaTRUE -0.369 0.085 -4.330 0.000
horrorTRUE 0.547 0.128 4.264 0.000
musicTRUE 0.481 0.262 1.834 0.067
romanceTRUE 0.197 0.107 1.848 0.065
sci_fiTRUE -0.210 0.113 -1.851 0.064
westernTRUE -2.685 0.496 -5.412 0.000

The second metric we assessed as a measure of movie success was movie reviews as measured by imdb_rating. Other potential predictors we included in the model were budget_2013, runtime, profit, award_winner, and all genre variables.

# Stepwise regression for imdb ratings
## IMDB ~ Bechdel (binary) + logBudget + runtime + award_winner + profit + Genre
modelIMDB = lm(imdb_rating ~ pass_bechdel + log(budget_2013) + profit + imdb_rating + action + adventure + animation + biography + comedy + crime + documentary + drama + family + fantasy + history + horror + music + musical + mystery + romance + sci_fi + sport + thriller + war + western, data = movies_df)
stepIMDB <- MASS::stepAIC(modelIMDB, direction = "both", trace = FALSE) %>% broom::tidy()
knitr::kable(stepIMDB, digits = 3)
term estimate std.error statistic p.value
(Intercept) 8.427 0.312 26.991 0.000
pass_bechdelTRUE -0.255 0.043 -5.943 0.000
log(budget_2013) -0.080 0.018 -4.450 0.000
profit 0.000 0.000 14.631 0.000
actionTRUE -0.418 0.060 -6.930 0.000
adventureTRUE -0.212 0.065 -3.266 0.001
animationTRUE 0.320 0.092 3.477 0.001
biographyTRUE 0.336 0.102 3.299 0.001
comedyTRUE -0.412 0.055 -7.490 0.000
crimeTRUE 0.140 0.062 2.268 0.023
dramaTRUE 0.199 0.056 3.526 0.000
familyTRUE -0.462 0.089 -5.218 0.000
fantasyTRUE -0.116 0.070 -1.655 0.098
horrorTRUE -0.684 0.081 -8.438 0.000
musicTRUE -0.236 0.139 -1.703 0.089
romanceTRUE -0.139 0.065 -2.138 0.033
thrillerTRUE -0.242 0.063 -3.834 0.000
warTRUE 0.233 0.156 1.499 0.134

As budget is a significant predictor of both profit and IMDB ratings, we also assessed whether passing the Bechdel test is an influential predictor of a movie’s budget. Other potential predictors we included in the model were runtime and all genre variables.

# Stepwise regression for budget
## logBudget ~ Bechdel (binary) + runtime + Genre
modelbudget = lm(log(budget_2013) ~ pass_bechdel + runtime + action + adventure + animation + biography + comedy + crime + documentary + drama + family + fantasy + history + horror + music + musical + mystery + romance + sci_fi + sport + thriller + war + western, data = movies_df)
stepbudget <- MASS::stepAIC(modelbudget, direction = "both", trace = FALSE) %>%  broom::tidy()
knitr::kable(stepbudget, digits = 3)
term estimate std.error statistic p.value
(Intercept) 13.521 0.186 72.873 0.000
runtime 0.026 0.001 17.882 0.000
actionTRUE 0.817 0.076 10.814 0.000
adventureTRUE 0.622 0.082 7.613 0.000
animationTRUE 1.300 0.124 10.492 0.000
biographyTRUE 0.326 0.134 2.433 0.015
comedyTRUE 0.287 0.073 3.917 0.000
crimeTRUE 0.173 0.080 2.157 0.031
documentaryTRUE -1.750 0.539 -3.246 0.001
dramaTRUE -0.247 0.071 -3.474 0.001
familyTRUE 0.677 0.115 5.890 0.000
fantasyTRUE 0.549 0.093 5.904 0.000
musicalTRUE 0.531 0.237 2.239 0.025
mysteryTRUE 0.483 0.101 4.771 0.000
romanceTRUE 0.285 0.084 3.388 0.001
sci_fiTRUE 0.424 0.090 4.726 0.000
sportTRUE 0.571 0.208 2.751 0.006
thrillerTRUE 0.331 0.083 3.976 0.000

The Bechdel test was selected as an influential predictor only for imdb_rating and not for profit or budget_2013. The IMDB ratings for movies that passed the Bechdel test were lower on average than for movies that failed. Based on this, passing the Bechdel test can be considered a neutral to negative factor to a movie’s potential success.

Discussion

At first, we explored the proportion of the movies from our dataset that passed the Bechdel test. We then also wanted to see if there was a change in the trend over time for the proportion of movies that passed the Bechdel test. We saw that 52.7% movies in the data passed the Bechdel test. There was also an increase in the proportion of movies that passed the Bechdel test over each decade.

For further analysis, we tried to measure the success of these movies based on factors like the budget, profit, and their IMDb ratings. While looking at these associations, we saw that the movies that passed the test had lower budgets and made lesser profits as compared to those that failed the test. Additionally, the movies that passed and failed the test had no appreciable difference between the IMDb ratings. On stratifying, there were more films that passed the Bechdel test and had higher critic ratings (compared to general audience ratings), and there were more films that failed the Bechdel test with lower critic ratings relative to audience ratings. This was suggestive of inherent biases held by general audiences that make them more critical of films with female-dominant roles, while critics may have been more likely to isolate these biases while judging films.

We found that a movie passing the Bechdel test plays an influential role in predicting IMDb ratings, but not profits or budget. Movies that pass the Bechdel test, on average, have lower ratings than movies that fail.

In conclusion, while not all results were statistically significant, our analysis pointed to a similar pattern, where passing the Bechdel test had a negative influence on our measures of success.

Superheroes Bonus

The Superheroes dataset was obtained using the Selector Gadget Chrome Extension.
We created a custom IMDB list containing all released movies from Marvel Cinematic Universe, as well as one for the DC Extended Universe. Several DC Superheroes movies not in the Extended Universe were also added. Using the Selector Gadget, we pulled variables title and year.
Data on budget, domestic and international gross was pulled using the Selector Gadget from Wikipedia.
Data on the Bechdel test results for each movie was merged from our movies_df dataframe, as well as bechdeltest.com.

The datasets obtained from each website were cleaned and merged to create the Superheroes_finally.xlsx file used for the Superheroes dashboard.

The Dashboard contains 2 graphs. One is depicting the proportion of movies that pass the Bechdel test in the DC Universe and the Marvel universe, represented as a bar plot. We can observe that marvel has a higher percentage of movies that pass the test.
The second graph depicts movies that pass the test over the years. The purpose of this scatter plot is to see the cluster of movies that pass the test is bigger post-2016 compared to before 2016, while no movie passed the test before 2008. This trend shows that the amount of superhero movies passing the Bechdel test is increasing.
finally, we put the search table in the dashboard for the user to easily be able to look at which movies passed the Bechdel test and which failed.



Members:

  • sc4934
  • sm5134
  • rp3022
  • sm4956
  • my2731