ROUGH DRAFT authorea.com/118923

# Mackay - Information, Compression and Probability

Probabilities can be viewed as frequencies of outcomes of an event or..
Probabilities can be used to describe degrees of beliefs of outcomes.

Cox axioms of consistency map beliefs to probability spaces if they satisfy the following axioms:

Degrees of belief are transitive i.e. if $$B(x) \geq B(y)$$ and $$B(y) \geq B(z)$$ then $$B(x) \geq B(z)$$.
The degree of belief of x and its negation are related i.e. there exists a function $$f$$ such that $$B(x) = f(B(x))$$
The degree of belief of a conjunction x and y is related to the degree of belief of the conditional proposition $$x | y$$ and $$B(y)$$ i.e. there exists a function $$g$$ s.t. $$B(x) = g(B(x | y)B(y))$$

Where x is a proposition with a true/false outcome, $$B(x)$$ is the degree of belief on that proposition, the negation of x is and the degree of belief of prop x given that y is true is $$B(x | y)$$.

The bayesian view is subjective in the way that probabilities depend on assumptions and you can’t make inference without assumptions. In this way probabilities can be used to describe different assumptions and to make an inference on those.

Forward probability problems: a generative model, in which a process is described and a model is given to characterize how the data at hand was generated. For example, taking white and black balls from urns. The model gives an explicit definition of the data’s distribution or certain moment (such as expectation, variance, etc).

Inverse probability problems: also a generative models, but instead of computing the prob. Distr. Of the process assumed to produce the data, the conditional probability of one or more unobserved variables in the process, given the observed variables i.e. the data.

Prior probability: given to that belief before “evidence” is taken into account ie. the probability distribution of the parameters. It is the marginal probability of that proposition.

Likelihood function(of the parameters): $$P(x|\theta)$$ is the conditional probability of the data given the parameters but is always taken as a function of (the parameters). Observe that it is not a probability since it doesn’t “add up” to 1. But if we fix then $$P(x|\theta)$$ is indeed a probability. Don’t say the likelihood of the data!

Posterior probability: the probability of the params, given the data.

Hypothesis: we hypothesize the different alternatives to the parameter values.

Difference to classical view: in the classical view one hypothesizes over the model’s parameters and then tests that hypothesis (or a bunch of them) to test its plausability. Whereas in the bayesian view the different hypothesis are all being ‘marginalized over’.

Subjective priors: in general, we need to make assumptions about the probability priors of the parameters. The values of these are unknown (or just fixed as some data) and a model needs to be assigned to test the hypothesis. The same goes to the likelihoods, assigning a distribution to the parameters in subjective way will change our likelihood function.

The likelihood principle: given a generative model for data $$d$$, given parameters $$\theta$$, the likelihood is defined as $$P (d | \theta)$$, and having observed a particular outcome $$d_1$$ , all inferences and predictions should depend only on the function $$P(d_1 | \theta)$$ i.e. they depend only on the data at hand, on what actually happened.

Shannon information content of an outcome: let x be an event/outcome then $$h(x)$$ is defined to be $$= log _2(\frac{1}{ P(x)})$$ Note that it is measured in bits and that less probable events carry more “information”. This number is a measure of the information content of a bit.

Entropy: defined as $$H(X) = E[log _2(\frac{1}{ P(x)})]$$ where $$X$$ is a random variable and by convention $$0*log(\frac{1}{0}) = 0$$. It is clear from the definition that $$H(x)\geq 0$$ and is equal to 0 only if x is s.t. $$P(x)=1$$.

Entropy is maximized when $$P(x)~Uniform$$ and as such for any given $$X$$ it goes that $$H(X) \leq log(|A(x)|)$$

Entropy is additive for two independent random variables i.e. $$H(X,Y) = H(X) + H(Y)$$

Decomposability of entropy: if $${p1,p2,..,pn}$$

# Introduction

Chagas disease is a tropical parasitic epidemic of global reach, spread mostly across 17 Latin American countries. The World Health Organization (WHO) estimates more than six million infected people worldwide (WHO 2016). The disease is caused by the Trypanosoma cruzi parasite. Most transmissions occur in the endemic regions in America, where T. cruzi is spread to humans by the Triatomine insect family (also called “kissing bug”, and known by many local names such as “vinchuca” in Argentina, Bolivia, Chile and Paraguay, and “chinche” in Central America). In recent years and due to globalization and migrations, the disease has become a health issue in other continents, particularly in countries who receive Latin American immigrants such as Spain and the United States (Schmunis 2010), making it a global health problem.

A crucial characteristic of the infection is that it may last 10 to 30 years in an individual without being detected (Rassi 2012), which greatly complicates effective detection and treatment. In effect, about 70% of individuals with chronic Chagas disease will never develop symptoms, whereas the remaining 30% will develop life-threatening heart and/or digestive disorders. Long-term human mobility, particularly seasonal and permanent rural-urban migration, thus plays a key role in the spread of the epidemic (Briceño-León 2009). Relevant routes of transmission also include blood transfusion and congenital transmission, with an estimated 14,000 newborns infected each year in the Americas (OPS 2006). The spatial dissemination of a congenitally transmitted disease sidesteps the available measures to control risk groups, and shows that individuals who have not been exposed to the disease vector should also be included in detection campaigns.

In this work we discuss the use of Call Detail Records (CDRs) for the analysis of mobility patterns and the detection of possible risk zones of Chagas disease in two Latin America countries. This project was performed in collaboration with the Mundo Sano Foundation, who provided key health expertise on the subject. We generate predictions of population movements between different regions, providing a proxy for the epidemic spread. Our objective is to show that geolocalized call records are rich in social and individual information, which can be used to determine whether an individual has lived in an epidemic area. We present two case studies, in Argentina and in Mexico, using data provided by mobile phone companies from each country. A discussion of how mobile data was processed is included.

Mobile phone records contain information about the movements of large subsets of the population of a country, and make them very useful to understand the spreading dynamics of infectious diseases. They have been used to understand the diffusion of malaria in Kenya (Wesolowski 2012) and in Ivory Coast (Enns 2013), including the refining of infection models (Chunara 2013). The referenced works on Ivory Coast were performed using the D4D (Data for Development) challenge datasets released in 2013. Additional studies based on the Ivory Coast dataset are reviewed in (Naboulsi 2015). However, to the best of our knowledge, this is the first work that leverages mobile phone data to better understand the diffusion of the Chagas disease.

# Chagas disease in Argentina

In Argentina vector control campaigns have been ongoing for more than 50 years as the main epidemic counter-measure. The Gran Chaco, situated in the northern part of the country, is home to most of the infected triatomines (OPS 2014). The ecoregion’s low socio-demographic conditions further supports the parasite’s lifecycle, where domestic interactions between humans, triatomines and animals foster the appearance of new infection cases, particularly among rural and poor areas. The ecoregion as of today is hyperendemic for the disease.