\documentclass{article}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage{xcolor}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage[natbibapa]{apacite}
\usepackage{eso-pic}
\AddToShipoutPictureBG{\AtPageLowerLeft{\includegraphics[scale=0.7]{powered-by-Authorea-watermark.png}}}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{subfigure}
\usepackage{nicefrac}
\usepackage{amsmath}
\usepackage{amsfonts}
\newcommand{\calG}{\mathcal{G}}
\newcommand{\calN}{\mathcal{N}}
\newcommand{\calE}{\mathcal{E}}
\newcommand{\calL}{\mathcal{L}}
\begin{document}
\title{About Bayes
%Unveiling Chagas with Big Data}
\author[ ]{Juan de Monasterio}
\affil[ ]{}
\vspace{-1em}
\date{}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\section{Mackay - Information, Compression and Probability}
Probabilities can be viewed as frequencies of outcomes of an event or..\\
Probabilities can be used to describe degrees of beliefs of outcomes.
Cox axioms of consistency map beliefs to probability spaces if they satisfy the following axioms:
Degrees of belief are transitive i.e. if $B(x) \geq B(y)$ and $B(y) \geq B(z)$ then $B(x) \geq B(z)$. \\
The degree of belief of x and its negation \overline{x} are related i.e. there exists a function $f$ such that $B(x) = f(B(x))$ \\
The degree of belief of a conjunction x and y is related to the degree of belief of the conditional proposition $x | y$ and $B(y)$ i.e. there exists a function $g$ s.t. $B(x) = g(B(x | y)B(y))$
Where x is a proposition with a true/false outcome, $B(x)$ is the degree of belief on that proposition, the negation of x is \overline{x} and the degree of belief of prop x given that y is true is $B(x | y)$.
The bayesian view is subjective in the way that probabilities depend on assumptions and you can't make inference without assumptions.
In this way probabilities can be used to describe different assumptions and to make an inference on those.
\textbf{Forward probability problems}: a generative model, in which a process is described and a model is given to characterize how the data at hand was generated. For example, taking white and black balls from urns. The model gives an explicit definition of the data's distribution or certain moment (such as expectation, variance, etc).
\textbf{Inverse probability problems}: also a generative models, but instead of computing the prob. Distr. Of the process assumed to produce the data, the conditional probability of one or more unobserved variables in the process, given the observed variables i.e. the data.
\textbf{Prior probability}: given to that belief before "evidence" is taken into account ie. the probability distribution of the parameters. It is the marginal probability of that proposition.
\textbf{Likelihood function(of the parameters)}: $P(x|\theta)$ is the conditional probability of the data given the parameters but is always taken as a function of \theta (the parameters). Observe that it is not a probability since it doesn't "add up" to 1. But if we fix \theta then $P(x|\theta)$ is indeed a probability.
Don't say the likelihood of the data!
\textbf{Posterior probability}: the probability of the params, given the data.
\textbf{Hypothesis}: we hypothesize the different alternatives to the parameter values.
\textbf{Difference to classical view}: in the classical view one hypothesizes over the model's parameters and then tests that hypothesis (or a bunch of them) to test its plausability. Whereas in the bayesian view the different hypothesis are all being 'marginalized over'.
\textbf{Subjective priors}: in general, we need to make assumptions about the probability priors of the parameters. The values of these are unknown (or just fixed as some data) and a model needs to be assigned to test the hypothesis. The same goes to the likelihoods, assigning a distribution to the parameters in subjective way will change our likelihood function.
\textbf{The likelihood principle}: given a generative model for data $d$, given parameters $\theta$, the likelihood is defined as $P (d | \theta)$, and having observed a particular outcome $d_1$ , all inferences and predictions should depend only on the function $P(d_1 | \theta)$ i.e. they depend only on the data at hand, on what actually happened.
\textbf{Shannon information content of an outcome}: let x be an event/outcome then $h(x)$ is defined to be
$= log _2(\frac{1}{ P(x)})$
Note that it is measured in bits and that less probable events carry more "information".
This number is a measure of the information content of a bit.
\textbf{Entropy}: defined as $H(X) = E[log _2(\frac{1}{ P(x)})]$ where $X$ is a random variable and by convention $0*log(\frac{1}{0}) = 0$.
It is clear from the definition that $H(x)\geq 0$ and is equal to 0 only if x is s.t. $P(x)=1$.
Entropy is maximized when $P(x)~Uniform$ and as such for any given $X$ it goes that $H(X) \leq log(|A(x)|)$
Entropy is additive for two independent random variables i.e. $H(X,Y) = H(X) + H(Y)$
\textbf{Decomposability of entropy}: if ${p1,p2,..,pn}$
\section{Introduction}
Chagas disease is a tropical parasitic epidemic of global reach, spread mostly across 17 Latin American countries. The World Health Organization (WHO) estimates more than six million infected people worldwide~\cite{who2016}. The disease is caused by the \textit{Trypanosoma cruzi} parasite. Most transmissions occur in the endemic regions in America, where \textit{T. cruzi} is spread to humans by the \textit{Triatomine} insect family (also called "kissing bug", and known by many local names such as "vinchuca" in Argentina, Bolivia, Chile and Paraguay, and "chinche" in Central America). In recent years and due to globalization and migrations, the disease has become a health issue in other continents, particularly in countries who receive Latin American immigrants such as Spain and the United States~\cite{schmunis2010chagas}, making it a global health problem.
A crucial characteristic of the infection is that it may last 10 to 30 years in an individual without being detected~\cite{rassi2012american}, which greatly complicates effective detection and treatment. In effect, about 70\% of individuals with chronic Chagas disease will never develop symptoms, whereas the remaining 30\% will develop life-threatening heart and/or digestive disorders.
Long-term human mobility, particularly seasonal and permanent rural-urban migration, thus plays a key role in the spread of the epidemic~\cite{briceno2009chagas}. Relevant routes of transmission also include blood transfusion and congenital transmission, with an estimated 14,000 newborns infected each year in the Americas~\cite{OPS2006chagas}.
% \begin{comment} en el drive estan las ppt del min salud \end{comment}.
The spatial dissemination of a congenitally transmitted disease sidesteps the available measures to control risk groups, and shows that individuals who have not been exposed to the disease vector should also be included in detection campaigns.
In this work we discuss the use of Call Detail Records (CDRs) for the analysis of mobility patterns and the detection of possible risk zones of Chagas disease in two Latin America countries. This project was performed in collaboration with the \textit{Mundo Sano} Foundation, who provided key health expertise on the subject. We generate predictions of population movements between different regions, providing a proxy for the epidemic spread. Our objective is to show that geolocalized call records are rich in social and individual information, which can be used to determine whether an individual has lived in an epidemic area. We present two case studies, in Argentina and in Mexico, using data provided by mobile phone companies from each country. A discussion of how mobile data was processed is included.
Mobile phone records contain information about the movements of large subsets of the population of a country, and make them very useful to understand the spreading dynamics of infectious diseases. They have been used to understand the diffusion of malaria in Kenya~\cite{wesolowski2012quantifying} and in Ivory Coast~\cite{enns2013human}, including the refining of infection models~\cite{chunara2013large}.
The referenced works on Ivory Coast were performed using the D4D (Data for Development) challenge datasets released in 2013. Additional studies based on the Ivory Coast dataset are reviewed in \cite{naboulsi2015mobile}.
However, to the best of our knowledge, this is the first work that leverages mobile phone data to better understand the diffusion of the Chagas disease.
\section{Chagas disease in Argentina}
In Argentina vector control campaigns have been ongoing for more than 50 years as the main epidemic counter-measure. The \textit{Gran Chaco}, situated in the northern part of the country, is home to most of the infected triatomines~\cite{OPS2014mapa}. The ecoregion's low socio-demographic conditions further supports the parasite's lifecycle, where domestic interactions between humans, triatomines and animals foster the appearance of new infection cases, particularly among rural and poor areas. The ecoregion as of today is hyperendemic for the disease.
The dynamic interaction of the triatomine infested areas and the human mobility patterns create a difficult scenario to track down individuals or spots with high prevalence of infected people or transmission risk. Available methods of surveying the state of the Chagas disease in Argentina nowadays are limited to individual screenings of individuals. The work described here is the first attempt to use mobile phone data to correlate migrations and cellphone usage to understand Chagas' epidemic spatial structure.
Recent national estimates indicate that there exist between 1.5 and 2 million individuals carrying the parasite, with more than seven million exposed.
National health systems face many difficulties to effectively treat the disease. In the world, less than 1\% of infected people are diagnosed and treated (in Argentina, on average, about two thousand people are treated yearly).
Even though governmental programs have been ongoing for years now~\cite{plan_nacional_chagas}, data on the issue is scarse or hardly accessible. This presents a real obstacle to ongoing research and coordination efforts to tackle the disease in the region.
This analysis allowed us to specifically detect outlying communities in the focused regions. Some of these can be seen directly from the previous heatmaps, where the towns of Avellaneda, San Isidro and Parque Patricios have been pinpointed.
\section{Machine Learning Algorithms}
Finally, the results stand as a proof of concept which can be extended to other countries or to diseases with similar characteristics.
\subsection{Random Forests}
Classification algorithms for this first iteration are based on the most common techniques found in the literature for this task. Random forests, Gradient Boosting and Logistic Regression are standard for this kind of jobs. For the purpose of fast benchmarking Multinomial Naive Bayes is also tested since it is a very fast non-parametric method.
Where possible, feature importance methods will be used to quantify the contribution of the feature or the interaction of features to the mobility of the users.
%\subsection{Maps for Mexico}
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{apacite}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}