Feature description

\label{sec:features}

In this section, we describe the features employed to train and test our classifier. From the information elements presented in the previous section, we define four main categories of features: linguistic, vocabulary, meta and thread. As further discussed in Section \ref{sec:related_work}, we do not exploit any user-related features (e.g., reputation score) because in legacy web forums such information is generally not available. Furthermore, user-related content changes frequently over time and thus, the proper use of user-related feature would require a real-time data analysis.

The overall set of features (22) is reported in Table \ref{tab:features}, arranged by categories. Both linguistic and vocabulary feature categories are concerned with readability. Specifically, linguistic features represent the attributes of questions and answers, and are intended to estimate their quality. Such linguistic features are called “shallow” because they measure readability through the “surface” properties of a text, such as the number of words and the average word length \cite{Pitler_2008}. As such, they are also computationally cheap. For example, the linguistic features include length (in characters), words count and average number of words per sentence in an answer. We also add contains hyperlinks to this category of features because in our previous works on factors affecting the probability of answer to be accepted we found that the presence of links to external resources is positively related to the perception of its completeness \cite{MSR_2015}.

Besides the linguistic features, in order to estimate the readability of an answer we also employ two vocabulary features, normalized log likelihood and the Flesch-Kincaid Grade. The normalized log likelihood (\(LL_{n}\), hereinafter), already employed in previous work \cite{Pitler_2008,Gkotsis_2014,Gkotsis_2015}, uses a probabilistic approach to measure to what extent the lexicon in an answer is distant from the vocabulary used in the whole forum community. Specifically, (\(LL_{n}\), hereinafter) is defined as follows:

\[\label{eq:LL} LL_{n}=\frac{LL = \sum_{w_{s}}C\left (w_{s} \right )\log \left ( P\left ( w_{s}|Voc\right )\right )}{UC\left (w_{s} \right )}\]

Given \(s\), a sentence in an answer, \(P(w_{s}|Voc)\) is the probability of the word \(w_{s}\) to occur, according to the background corpus \(Voc\), and \(C\left(w_{s}\right)\) is the number of times the word \(w_{s}\) occurs in \(s\). \(LL\) is normalized by dividing it over the number of unique words occurring in \(s\). The normalization is necessary to take into account answers of different lengths.

The Flesch-Kincaid Grade (F-K, hereinafter), defined in \cite{Kincaid_1975} and already used by Burel et al. in \cite{Burel_2012}, is a readability metric for English calculated as follows:

\[\label{eq:FKG} F\text{-}K_{p_{i}}(awsp_{p_{i}}, asps_{p_{i}}) = 0.39~awps_{p_{i}} + 11.8~asps_{p_{i}} - 15.59\]

In the definition (\ref{eq:FKG}) above, for any given post \(p_{i}\), \(awps_{p_{i}}\) is the average number of words per sentence and \(asps_{p_{i}}\) is the number of syllables per word.

Two are the features belonging to the meta category. The first one, age applies only to answers and it computes the time difference since the question has been posted. The second one, rating score is the score (i.e., the number of upvotes minus the number of downvotes) a post received by users and reflects its perceived usefulness. The thread features include only the answer count, i.e., the number of answers to a question, a measure that reflects the popularity of a thread. Furthermore, as shown in the rightmost column of Table \ref{tab:features}, for all the feature categories except thread, we also assign ranks after computing their numeric values. In other words, for each question thread, we group all answers, compute a feature, and then rank the values in ascending or descending order (in the following, we refer to this procedure as ranking). For instance, for the word count linguistic feature, the answer in the thread with the largest value ranks 1, the second largest ranks 2 and so on (i.e., descending order) because we assume that long answers are more accurate and elaborate and, hence, have a larger chance of being accepted. For the age feature, instead, we assign the rank 1 to the quickest answer (i.e., ascending order), because previous research has demonstrated that responsiveness plays a major role in getting answers accepted in Q&A websites \cite{Mamykina_2011}. This ranking approach has been inspired by the discretization of features reported by Gkotsis et al. in \cite{Gkotsis_2014,Gkotsis_2015} who found that feature discretization makes binary classifiers for best-answer prediction robust across different Q&A sites, albeit all belonging to the same Stack Exchange platform. This is is appealing to us because we use different datasets for training and testing our model.

\label{tab:features}

Summary of the features in our model, arranged by category.
Category Feature Also ranked
Length (in characters) Yes
Word count Yes
No. of sentences Yes
Longest sentence (in characters) Yes
Avg words per sentence Yes
Avg chars per word Yes
Contains hyperlinks
\(LL_{n}\) Yes
\(F\)-\(K\) Yes
Age (hh:mm:ss) Yes
Rating score (upvotes - downvotes) Yes
Thread Answer count

Finally, to conclude the description of how our classifier is built, we clarify our definition of “best answer”. In our study, as well as in Stack Overflow and Docusign, the best answer is the one marked as accepted by the original asker. However, sometimes, the best answer in a thread is not actually the one accepted – i.e., found to be useful by the original asker, but rather the one that receives more upvotes – i.e., found to be the most useful by the whole community. Therefore, we clarify that in our study we are not considering the “absolute best answer”, but rather the “fastest and good-enough answer” that provides a prompt and effective solution to the problem reported by the asker. This is because, on one hand, we believe that the time dimension is a predictor that cannot be overlooked. Hence, we included the meta feature age in our model. On the other hand, this is also the same approach taken by Stack Overflow. Therefore, being it the target platform of our migration experiment, we remain consistent with Stack Overflow conceptualization of best answer.