GITCoP: A Machine Learning Based Approach to Predicting Merge Conflicts from Repository Metadata \cite{ziegler2017}

This MSc thesis aims to predict merge conflicts by using machine learning techniques. They use three datasets for their works, jdime-dataset, and two self-mined datasets by crawling GitHub (in C and Java). THey use the features of each branch and the conflict features separately and find out that the combination was more effective. They employ Decision Trees, Support Vector Machines, Naive Bayes, Logistic Regression, and Random Forest as classifiers and use AdaBoost to increase the classification performance. The validation process is quite acceptable since they use Accuracy, Precision, Recall, and F1-score altogether. Using all essential performance measures are especially important for this problem due to being imbalance. However, the feature selection and extraction could be better. First, only a few number of hand-picked features are employes. Besides, the code features are ignored. Finally, the features employed without any preprocessing or extraction process. As a suggestion, Principal Component Analysis can be employed to reduce the noise and increase the status of discriminative. From the classification point of view, the employed classifiers are basic and using state-of-the-art models may increase the performance.