hypothesis is that the DC and Marvel movies have significantly distinguishable style and they discuss different topics. Based on this, DC and Marvel movies create different experience for the audience so the reviews can show the differences of the movies.
Essentially this is a binary classification problem. The dependent variables is the studio (DC or Marvel), and the independent variables are the audience reviews. That is, we can classify the DC and Marvel movies based on the reviews. At the same time, this can be a clustering problem as well. We can distinguish the DC and Marvel movies with topic models assuming they discuss different topics, such as democracy, feminism, self-growth, and collaboration.
Data
- collected data with web crawler
- text preprocessing
- premilitary analysis (time series, rating difference, box office difference, etc.)
Methods
- Methods: most frequent words, -> to make the difference stands out ->removed shared words, modeling (SVM, nb, rf used to classified movie type using review text(rf is the best one based on the accuracy graph); using negative revies has the highest accuracy. As expected, most words and important features involved movie/people/character names in the movie -> remove major movie names and character/people names and run everything again
- Topic models
- wordfish: the best model
- LDA: output significant result.
- LSA: worse than LDA. abandoned.
- The Structural Topic Model (STM) yield a best number of topics 90 with the algorithm of Lee and Mimno (2014) but 90 topics are not suitable for. this case. When setting the number of topics as 5, the same as the LDA model, we can distinguish DC and Marvel with 2 specific topics while other 3 topics show slight different between DC and Marvel. To our surprise, the topic that represents DC is featured by "bad" and "comic" while the other topic that represents Marvel is featured by "good" and "fun".
- Word Embedding model: context analysis.