\documentclass[11pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\usepackage{times}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[ngerman,english]{babel}
\newcommand{\msun}{\,\mathrm{M}_\odot}
\begin{document}
\title{Ranking Significant Features for Increasing Engagement on Social Media
via Regression Analysis}
\author[1]{Thiago R. C. de Lima}%
\affil[1]{Affiliation not available}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\selectlanguage{english}
\begin{abstract}
Social media comprises of platforms that surpassed their initial goal to
connect people just for the sake of socializing and currently provide
powerful tools for businesses to reach millions of views worldwide,
increasing their chances of gaining new customers. This short paper
utilizes the Buzz in Social Media data set available at UCI Machine
Learning Repository for identifying the attributes in social media
content that have the highest correlation to the amount of repercussion
it gained. To achieve such result, several linear regression models are
constructed, then ranked based on their respective~model fit measure
(R-squared) and accuracy when tested against unseen data.%
\end{abstract}%
\sloppy
\section {Introduction}
\label{intro}
During the past two decades, the world wide web has seen a great shift on how users interact with the internet. The era of the startups has been a playground for entrepreneurs to try out innovative ways to captivate potential customers and retain users on their platforms and services. While many startups fail or go bankrupt \cite{fail} \cite{genome2019}, others get sold for millions of dollars \cite{million} and a few find their own success and remain strong. For social media platforms to survive, it is imperative to have an active user base. With the growing registered accounts, and ability to to find out each persons likes and dislikes, such platforms has caught the interest of business wanting to invest on marketing campaigns that have the highest return \cite{saravanakumar2012social}. The more users interact with each other, the more data can be gathered and a better profile can be set for each user, thus allowing targeted ads to be more and more tailored to each potential customer.
Each social media website has its own algorithm for measuring engagement on specific content provided by some user. Typically such content can be rewarded with increased reach \cite{milan2015algorithms}. For instance, on Twitter there is a list of Trending topics. On YouTube, a video may be presented on the home page. Naturally, the more exposed this content gets to users who hadn't seen it yet, the more engagement it might get.
On this research I will analyze data gathered from the Twitter platform. My goal is to find out whether any of the attributes of a tweet has a strong correlation with the amount of discussion it gathered. Some work on this field has been done by \cite{kawala:hal-00881395}.
\section {Methodology}
\label{method}
\subsection {Data}
\label{data}
The data set used for this research was provided by Fran\selectlanguage{ngerman}çois Kawala, Ahlame Douzal, Eric Gaussier, and Eustache Diemert (from Université Joseph Fourier and BestofMedia Group) and is currently available at the UCI Machine Learning Repository \cite{sets}, hosted by the University of California Irvine \cite{re3dataorg}. This data set contains a total of 40000 rows and up to 96 columns, across data gathered from Twitter and TomsHardware \cite{team}. However, for this research I am using only the Twitter database, which contains 77 attributes for each of its 38393 samples.
Attributes are presented in temporal fashion, varying according to each observation date. Each row contains 7 values each of the following categories: Number of Created Discussions, Author Increase, Attention Level, Burstiness Level, Number of Atomic Containers, Attention Level (measured with number of contributions), Contribution Sparseness, Author Interaction, Number of Authors, Average Discussions Length, Number of Active Discussion. Finally, there is a single value in each row for Mean Number of Active Discussion which I'll use as the target attribute, that is, the one I'm trying to predict.
\subsection {Tools}
\label{tools}
I have developed a script using the Python language and SciPy package. Python is a general purpose programming language \cite{van1995python} \cite{programming} that has gained notoriety in the data science field \cite{kdnuggets} \cite{results} and SciPy is tool set for data analysis, manipulation and visualization \cite{blanco2013learning}. For this particular research, I only used SciPy for calculating the actual linear regression, which returned the slope, the intercept point, the raw R value, the P value and the standard error of the estimated slope \cite{guide}. The source code can be found in my repository on GitHub \cite{thiagorcdlsocial_media_buzz}.
The script is responsible for loading the data set, converting the data type from strings to floating point representation, partitioning the data set into five folds for cross-validation, then for each predictor attribute, creating a linear regression model and testing such model against the testing portion of the partitioned data. Finally, the script ranks correlations and accuracy of the predictors and picks the ten with the highest scores for each one of the chosen metrics.
For evaluating the effectiveness of an attribute on predicting the target feature, I employed two distinct metrics: first, using the R-squared, also known as coefficient of determination \cite{zhang2017coefficient}, and secondly, the accuracy of models when comparing the yielded result against the known value for the training sample. The accuracy is calculated as the inverted error. The error is given by the difference between the expected value (known value for the target feature in the training data) and the resulting value from feeding the model using the training data as input.
\section {Results}
\label{results}
In summary, I analyzed a comprehensive data set from Twitter to find attributes that could serve as predictors of the amount of engagement on the comment sections. I applied linear regression models for each feature and cross-validated the results among 5 partitions of data, averaging them and picked the ten most significant features based on two different metrics. The resulting rankings can be found in \textit{Table 1} and \textit{Table 2}.
According to the ranking based on the average R-squared values, the feature with strongest correlation to the sixth observation of \textit{Number of Created Discussions} (\textit{NCD\_6}), followed by the sixth observation of \textit{Number of Active Discussion} (\textit{NAD\_6}) and \textit{Number of Atomic Containers} (\textit{NAC\_6}). Coefficient of determination with values close to 1, as is the case of \textit{NCD\_6} with 0.913208484499559, tend to represent a strong correlation between the the predictor and the target attribute.
When applying the models to the testing data, the highest ranked feature are the exact same ones previously observed. From the fourth place forward, there are some differences in both rankings. Still, out of the ten features in the R-squared ranking, only one is not present on the accuracy ranking (and vice-versa).\selectlanguage{english}
\begin{table}[h!]
\centering
\normalsize\begin{tabulary}{1.0\textwidth}{CCC}
Rank & Attribute & Average R2 \\
1 & NCD\_6 & 0.91 \\
2 & NAD\_6 & 0.91 \\
3 & NAC\_6 & 0.91 \\
4 & NCD\_5 & 0.85 \\
5 & NAD\_5 & 0.85 \\
6 & NAC\_5 & 0.84 \\
7 & NA\_6 & 0.82 \\
8 & NCD\_1 & 0.79 \\
9 & NAD\_1 & 0.79 \\
10 & NCD\_4 & 0.79 \\
\end{tabulary}
\caption{{Ranking of the ten features highest coefficient of determination
{\label{448570}}%
}}
\end{table}\selectlanguage{english}
\begin{table}[h!]
\centering
\normalsize\begin{tabulary}{1.0\textwidth}{CCC}
Rank & Attribute & Average Accuracy \\
1 & NCD\_6 & 0.02 \\
2 & NAD\_6 & 0.02 \\
3 & NAC\_6 & 0.02 \\
4 & NA\_6 & 0.02 \\
5 & NAD\_5 & 0.02 \\
6 & NCD\_5 & 0.02 \\
7 & NAC\_5 & 0.02 \\
8 & NA\_5 & 0.01 \\
9 & NAD\_1 & 0.01 \\
10 & NCD\_1 & 0.01 \\
\end{tabulary}
\caption{{Ranking of the ten features which produced models with highest accuracy
for unseen data
{\label{987099}}%
}}
\end{table}By plotting the linear regression model for the most significant feature, \textit{Number of Created Discussions}, over the testing data for each fold, it is possible to observe how the predictions hold themselves reasonably well even on unseen data. These charts can be found in \textit{Fig.1}, \textit{Fig.2}, \textit{Fig.3}, \textit{Fig.4} and \textit{Fig.5}.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/r2-NCD-6-00/r2-NCD-6-00}
\caption{{NCD\_6 plotted over Fold 01
{\label{793133}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/r2-NCD-6-04/r2-NCD-6-04}
\caption{{NCD\_6 plotted over Fold 02
{\label{898146}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/r2-NCD-6-03/r2-NCD-6-03}
\caption{{NCD\_6 plotted over Fold 03
{\label{346971}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/r2-NCD-6-02/r2-NCD-6-02}
\caption{{NCD\_6 plotted over Fold 04
{\label{478810}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/r2-NCD-6-01/r2-NCD-6-01}
\caption{{NCD\_6 plotted over Fold 05
{\label{673896}}%
}}
\end{center}
\end{figure}
\subsection {Observations}
\label{obs}
The chosen metrics were able to come up with very promising results, that is, using the features that had high R-squared values to predict the target feature on unseen data resulted similar performance, even though the accuracy values were far from the highest possible value, 1, which would mean a perfect fit.
The ability for a model to perform with unseen data suggests the coefficient of determination is a good metric for finding out significant attributes in a data set when using linear regression. Therefore, for the purpose of predicting the levels of engagement on Twitter, the Number of Created Discussions is a good attribute to look at.
For future work, an option would be the analysis of precision when using multiple predictors at once.
\selectlanguage{english}
\FloatBarrier
\nocite{*}
\bibliographystyle{unsrt}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}