Methods

Term Frequency Analysis 

We first preprocessed the textual review data by stemming the words, removing punctuations, numbers, stopwords, and creating a document-frequency matrix (DFM) for both Marvel and DC reviews.  We calculated Cosine similarity, Euclidean distance, and Manhattan distance between them to find the overall differences in term frequency, and then extracted the top 350 terms from each of them.  When comparing the 350 top frequent terms in Marvel reviews and DC reviews, in order to highlight the differences, we decided to remove the overlap terms between them. 

Classification Models

We used three classification models to classify Marvel reviews and DC reviews: Naive Bayes classifiers, Support Vector Machine, and Random forest.  The independent variables are the document-frequency matrix of each review, and the label (DC or Marvel) is our predictive value.  We conducted an 80% to 20% train test split for each model and compare their performances within the test sets. 

Re-process Data

Through conducting previous methods, we found most of the differences between Marvel and DC reviews are contributed by the Movie names, character names, and actor(actress) names, for example, Iron Man, Batman, Steve Roger, etc. Therefore, in order to extract more specific information that demonstrates the fundamental difference between movies, we decided to remove the major movie names, character names, and actor(actress) names from our review data, and re-run all the analysis and model using the reprocessed data.

Topic models

We used Wordfish, LDA, and STM topic models to find out whether DC and Marvel focus on different themes or topics in their movies; if yes, what kind topics do they each emphasis on. The unsupervised topic models output the weighted terms for each topic, as well as the proportion of topics that contributed to each movie kind. We labeled each topic with 5 most relevant words and then calculated the topic component ratios for DC and Marvel reviews, to compare the differences in their movie topics.

Results

Cosine similarity, Euclidean distance, and Manhattan distance

The Cosine similarity between the DC reviews and Marvel reviews is 0.89 before the removal of movies' and people's names, and 0.95 after the removal. The Euclidean distance between the two also reduced from 31373 to 21661 after removal, and the Manhattan distance reduced from 672959 to 579873. Such results show there is a high degree of similarity in terms used between Marvel reviews and DC reviews, and the names of Marvel and DC contribute to most of the differences.

Top frequent terms in DC reviews and Marvel reviews

Based on the top 20 frequent terms that extracted from the Marvel reviews and DC reviews (Fig.\ref{835062}), we can find there are many overlap terms between them, such as "good," "like," "great," "best," "charact," which indicate that audiences tend to use a large number of positives words when evaluating both DC and Marvel movies.