Carlos Sarraute edited section_Data_sources_Our_data__.tex  almost 8 years ago

Commit id: 06bc8abba34afe30911f6a7681a4e0505e3ca404

deletions | additions      

       

\section{Data sources}  Our data source is anonymized traffic information from two mobile operators, in Argentina and in Mexico.  For our purposes, each record is represented as a tuple $\left < x, y, i, j,  t, d, l \right >$, where user $x$ $i$  is the caller, user $y$ $j$  is the callee, $t$ is the date and time of the call, $d$ is the direction of the call (incoming or outgoing, with respect to the mobile operator client), and $l$ is the location of the tower that routed the communication.  The dataset does not include personal information from the users, such as name or phone number. The users privacy is assured by differentiating users by their hashed ID, with encryption keys managed exclusively by the telephone company.  As data preprocessing, to exclude outlying users such as call-centers or dead phones, the users whose monthly cellphone use did not surpass a minimal number of calls $\mu$ or exceeded a maximal number $M$ were automatically filtered. In both dataset, we used $\mu = 5$ and $M = 400$.  \subsection{Argentina} We then aggregate the call records for a five   month period into an edge list $(n_i, n_j, w_{i,j})$ where nodes $n_i$ and $n_j$   represent users $i$ and $j$ respectively and $w_{i,j}$ is a boolean value  indicating whether these two users have communicated at least once within the   five month period. This edge list will represent our mobile graph   $\calG = \left< \calN, \calE \right> $ where $\calN$ denotes the set of nodes (users)   and $\calE$ the set of communication links. We note that only a subset $\calN_C$ nodes in $\calN$  are clients of the mobile operator, the remaining nodes $\calN \setminus \calN_C$ are  users that communicated with users in $ \calN_C $ but themselves are not clients of  the mobile operator.   Since geolocation information is available only for users in $\calN_C$, in the analysis we considered the graph $\calG_C = \left< \calN_C, \calE_C \right> $ of communications between clients of the operator.  \paragraph{Argentina.}  We used the information from a mobile operator in Argentina, collected over a period of 5 months. The raw data logs contain around 50 million calls per day. \subsection{Mexico} \paragraph{Mexico.} The Mexican data source is an anonymized dataset from a national mobile phone operator. Data is available for every call made within a period of 19 months from January 2014 to September 2015. The raw logs contain about 12 million calls per day for more than 8 million users that accessed the telecommunication company's (TelCo) network to place the call. This means that users from other companies are logged, as long as one of the users registering the call is a client of the operator. In practice, we only considered CDRs between users in $\calN_C$ since geolocalization was only possible for this group.  The Mexican data source is an anonymized dataset from a national mobile phone operator. Data is available for every call made within a period of 19 months from January 2014 to September 2015. The raw logs contain about 12 million calls per day for more than 8 million users that accessed the telecommunication company's (TelCo) network to place the call. This means that users from other companies are logged, as long as one of the users registering the call is a client of the operator. In practice, we only considered CDRs between TelCo users since geolocalization was only possible for this group. %  Information logged for each call included the duration and timestamp of the call, the users participating in the call and the antenna id that transmitted the call to the TelCo client. (EJEMPLO DE UNA TABLA DE DATOS CRUDOS o simpleformat?)