Theory and Hypotheses
Our hypothesis is that DC and Marvel movies are significantly distinguishable in style and topics. We believe DC and Marvel movies create different experiences for audiences, therefore the difference in their reviews can be used to demonstrate their differences between movies.
Essentially this is a binary classification problem, where DC and Marvel(studios) are the two dependent variables, and the independent variables are the audience reviews. At the same time, this can be a clustering problem as well. We can distinguish the DC and Marvel movies with topic models assuming they focus on different topics, such as democracy, feminism, self-growth, and collaboration.
Data
- collected data with web crawler
- text preprocessing
- premilitary analysis (time series, rating difference, box office difference, etc.)
Data collection
We collected the reviews and the basic information of the movies with web crawler from Rottentomatoes websites. Piplines were built in python to collect the website data aumatically. Requests, BeautifulSoup, and re packages are used to request and phrase the web page. To avoid impact the Rottentomatoes server, we set a interval request time. Finally, csv files were output for further analysis.
First of all, we got the movie name lists of DC and Marvel. And then we requested and collected the corresponding Rottentomatoes homepage links in the movie search system. Secondly, we collected the basic information of the movies from their home pages, including the critics consensus, critical score, audience score, count of critical reviews, count of audience reviews, movie abstract, rating, genre, director, writer, in theater date, on disc streaming date, box office, runtime, and studio. Thirdly, we collected all the audience review data. Each review contains review text, star, and date. And for each movie, maximum 51 pages (about 1000 reviews) are available on the Rottentomatoes. Totally, 40385 reviews for 46 DC movies and 21186 reviews for 21 Marvel movies are collected.