Moving to Stack Overflow: Best-Answer Prediction in Legacy Developer Forums


Context: Recently, more and more developer communities are abandoning their legacy support forums, moving onto Stack Overflow. The motivations are diverse, yet they typically include achieving faster response time and larger visibility through the access to a modern and very successful infrastructure. One downside of migration, however, is that the history and the crowdsourced knowledge hosted at previous sites remain separated or even get lost if a community decides to abandon completely the legacy developer forum.
Goal: Adding to the body of evidence of existing research on best-answer prediction, here we show that, from a technical perspective, the content from existing developer forums might be automatically migrated to the Stack Overflow, although most of forums do not allow to mark a question as resolved, a distinctive feature of modern Q&A sites.
Method: We trained a binary classifier with data from Stack Overflow and then tested it with data scraped from Docusign, a developer forum that has recently completed the move.
Results: Our findings show that best answers can be predicted with a good accuracy, only relying on shallow linguistic (text) features, such as answer length and the number of sentences, combined with other features like answer upvotes and age, which can be easily computed in near real-time.
Conclusions: Results provide an initial yet positive evidence towards the automatic migration of crowdsourced knowledge from legacy forums to modern Q&A sites.


The growing popularity of modern, community-based question answering sites (Q&A) like Stack Oveflow descends from mailing lists and web-based discussion forums. As software grew in complexity, developers more and more needed to seek support from experts outside their inner circles (Squire 2015). This trend, in fact, has been steadily increasing over the last two decades. At first, mailing lists were useful enough because they allowed developers to archive and search the generated knowledge. Still, searching the archived content required a separated web interface. Then, web-based discussion forums with integrated search represented a step forward in both ease of use and efficiency, preventing same questions to be asked over and over. Nonetheless, the inability to mark help requests as resolved made all that user-generated knowledge difficult to access because, in popular threads, a complete and useful answer is often located several pages away from the start. The ability to highlight in the first position the accepted answer in a thread and the combined psychological effects of gamification are the main factors that boosted the success of Q&A platforms like Quora and Yahoo! Answers (Grant 2013, Mamykina 2011).


Forum category Questions threads Questions resolved (%) Answers % Answers accepted
Java 103 43 (41.75%) 383 11.23%
.Net 490 183 (37.35%) 1660 11.02%
Ruby 156 39 (25%) 553 7.05%
PHP 144 53 (36.81%) 616 8.60%
Misc 679 155 (22.83%) 1538 0.97%
Tot 1572 473 (30.08%) 4750 9.96%

As such, recently, more and more developer communities are abandoning their legacy support forums, moving onto Stack Overflow (Squire 2015). Being the first and largest Q&A site of the Stack Exchange network, Stack Overflow is a community where millions of programmers ask questions and provide answers about software development on a daily basis. The motivations are diverse, yet they typically include: achieving faster response time, reaching out to a larger audience of experts, increasing visibility and having a free access to a modern and very successful infrastructure. To further encourage and facilitate the move, Stack Overflow even allows to define custom tags that identify threads from specific developer communities. Despite these evident benefits, one consequent downside of moving onto another platform is that the history and the crowdsourced knowledge generated through a legacy support forum is going to remain separated from the new one or, if a community decides to completely dismantle it, even get lost. For instance, based on the analysis of 21 developer support communities that moved to Stack Overflow between 2011 and 2014 (Squire 2015), we found that about 20% of them did not archive their content, thus causing a loss of knowledge that still appears in Google searches but turns out to be inaccessible. One might question the value of “old” knowledge hosted at legacy forums. Indeed, information obsolescence in knowledge platforms is a challenging research problem – i.e., even questions in Stack Exchange may become old. However, recent research by Anderson et al. (Anderson 2012) has observed the emergence of many long-lasting value conversations in Stack Overflow as an effect of its shift from supporting the initial information-seeking scenario towards the one of creating crowdsourced knowledge that remain valuable over time.

To date, none of the available modern Q&A platforms allows to import existing content from other sources. The migration of content poses several challenges, such as coping with different interaction styles, ensuring quality of imported content, dealing with user reputation and the lack of information about accepted answers and resolved questions. Still, the migration of the whole history of question threads (plus the identification of the accepted answers) from a legacy forum towards a modern Q&A site such as Stack Overflow would be beneficial to all developers who browse the web seeking to solve their technical problems. Some initial work has been performed by Vasilescu et al. (Vasilescu 2014), who investigated how to match the identities of the R-help community members after the migration onto the Stack Exchange platform from the original mailing list. Here, instead, we demonstrate the technical feasibility of content migration. Building on our previous research on answer quality in Stack Overflow (Calefato 2015), we design and run an experiment where a binary classifier is trained to identify the accepted answers in question threads from a legacy developer forum. Specifically, we use a training dataset obtained from Stack Overflow consisting of answers from question threads both closed (i.e., resolved, with one accepted answer in the thread) and still open (i.e., unresolved, with no answer in the thread marked as accepted). The classifier is then evaluated using a test dataset obtained from the legacy support forum of Docusign1, an electronic signature API for securing digital transactions, which has been recently abandoned as its community moved to Stack Overflow.

The contribution of the paper is twofold. First, the results show that we are able to identify best answers with a good performance (accuracy ~90%, F=.86, AUC=.71), only relying on easy to compute and shallow text features, such as the number of words and sentences, the values of which are then ranked (Gkotsis 2014, Gkotsis 2015). Our classification model does not rely on any user-related features (e.g., user reputation, badges, number of accepted answers) because they are generally not available in old support forums. Besides, even when available, user-related features are very dynamic and need to be constantly recomputed. Second, we use two corpora from different data sources, thus strengthening the generalizability and robustness of our best answer classifier. The remainder of the paper is structured as follows. In Section \ref{sec:datasets}, we present the two datasets used to train and evaluate our classifier. In Section \ref{sec:features} and Section \ref{sec:evaluation}, respectively, we report the features included in our model and report the results from the experiment on best answer prediction. The findings and their limitations are first discussed in Section \ref{sec:discussion} and then compared to previous work in Section \ref{sec:related_work}. Finally, we conclude in Section \ref{sec:conclusions}.




Two datasets are used to run the experiment reported in this paper. The first dataset comes from the dump of the legacy forum used by the developer community of Docusign.1 Since June 2013, the original forum has become read-only, inviting community members to use instead Stack Overflow for development-focused requests, following the custom tag docusignapi . To obtain a dump of the content from the now-abandoned Docusign developer support forum, we built a custom scraper using the Python library Scrapy2. The scraper downloaded all the question threads from the forum and stored them in a local database for convenience. The API is available for several programming languages, namely Java, .NET, Ruby and PHP. Consequently, the forum content was organized according to the programming language, other than having a miscellaneous category that contains questions related to API (e.g., SOAP errors) but not connected specifically to one specific language. As shown in Table \ref{tab:docusign}, overall the dataset contains 4750 answers to 1,572 questions threads. Besides, we notice that the dataset is skewed towards unaccepted answers since only ~10% is marked as accepted solution.

The Docusign dataset is intended to be used as test set for a binary classifier discussed next (see Section \ref{sec:features}), trained using another dataset retrieved from Stack Overflow. We opportunistically selected Docusign as test set because it allowed the question asker to select one answer in the thread as the accepted solution. Although this is typical of Stack Overflow and other modern Q&A sites, such feature is hard to find in legacy web forums. As such, because the dataset is already annotated with accepted answers, it avoids the need for manually creating a gold standard, i.e., using experts to identify among many answers the accepted solutions (if any) for each thread in the dump. Such procedure is obviously more prone to errors than relying on a dataset like Docusign in which an accepted answer is always marked by the original asker, i.e., the one user who had a problem and sought out a solution from others. Therefore, albeit not common in other legacy forums, here Docusign represents an optimal choice for validating our approach.

Regarding the training set, it consists of over 232,00