3.2 Dataset Preprocessing
It is essential to clean data before using it to generate model
features, especially for Reddit data, which consists of several
different sets of data. The authors’ initial step was to concatenate the
most recent N titles or sentences for each user, where Nis the minimum number of posts necessary for the suggested model to work
correctly. The time between two consecutive posts or comments made by
the same person was another type of data that the authors were able to
retrieve. After grouping the data, the authors removed all text-based
links, capitalised nothing, and created ”bag of words” representations
of the sequences. The authors then created a lemma for each term using
WordNetLemmatizer.