3.1 Dataset Description
The Reddit posts and comments in this dataset are from 892 different individuals. The remaining people served as a control group, with 137 people receiving treatment for depression. The Reddit API limit appears to be 1000 posts and 1000 comments per user as shown in Figure 1. Additionally, both posts and comments are chronologically ordered, which is crucial given the goal of early depression identification. One XML file was created for each user and used to build the collection. Only those who have publicly admitted to having a diagnosis of depression are classified as depressives, while users who come to Reddit via sub-Reddits devoted to the topic are not depressed; instead, they are interested in learning more about depression because someone close to them is experiencing it. Each entry is identified by:
The 486 train subjects, 83 of whom are positive, and the 406 test subjects, 54 of whom are positive, make up the datasets’ train-test split.