Text Analysis of Customer Review
Customer feedbacks are the most direct indicators of product's performances and they have been increasingly affecting buyers’ decisions. Customer reviews have become a valuable source for companies to analyze their products. There is plenty of academic research on analyzing customer reviews \citep{Nadali_2010}\cite{Bagheri_2013}\cite{Doan_2016}\cite{Mehto_2016}; however, these studies largely focused on sentiment analysis of reviews \cite{Ordenes_2014}. Although it is crucial to identify sentiments in the feedback, there is more specific information contained in textual reviews. One relevant study done by \citet{Xu_2011} uses Amazon customer reviews to extract and analyze comparative relations between products. They proposed that through text analysis of customer reviews, they can help companies discover potential risks and further improve their products designs and marketing strategies. Based on this idea, we decided to perform text analysis of movie reviews to identify the potential comparative relations and differences between DC and Marvel movies.
Theory and Hypotheses
Our hypothesis is that DC and Marvel movies are significantly distinguishable in their styles and topics. Thus We believe DC and Marvel movies create different experiences for audiences, so the difference in their reviews can be used to demonstrate the differences between their movies.
This can be a binary classification problem, where DC and Marvel are the two dependent variables, and the independent variables are the audience reviews. At the same time, this can also be a clustering problem, which we can distinguish the DC and Marvel movies with topic models assuming they focus on different topics, such as democracy, feminism, self-growth, and collaboration.
Data
Data collection
We collected the reviews and the basic information of the movies with a web crawler from Rottentomatoes websites. Pipelines were built in python to collect the website data automatically. Requests, BeautifulSoup, and re packages are used to request and phrase the web page. To avoid impact the Rottentomatoes server, we set an interval request time.
We first obtained a list of DC and Marvel and then requested and collected the corresponding homepage links from Rottentomatoes. From each link, we collected the basic information of the movies including the critics consensus, critical score, audience score, count of critical reviews, count of audience reviews, movie abstract, rating, genre, director, writer, in theater date, on disc streaming date, box office, runtime, and studio. Then, we collected all the audience review data, with each review contains review text, star, and date. For each movie, there is a maximum of 51 pages (about 1000 reviews) available on the Rottentomatoes. In total, 40385 reviews for 46 DC movies and 21186 reviews for 21 Marvel movies were collected and were outputted in CSV files for further analysis.
Preliminary analysis
DC and Marvel movies are similarly popular among audiences. However, based on our preliminary analysis(Fig.\ref{652341}), Marvel movies have higher and stabler scores. The average score for Marvel is 83.5 with a standard deviation of 7.4, while DC has an average score of 67.3 with a standard deviation of 22.5. In terms of the box office, Marvel has an average box office of 330 million dollars, which is the double of DC's.