Authorea

Paf Paris edited untitled.tex about 8 years ago

Commit id: 7dd3ba1185a6cc71e495eb9236ba7b8453223ffe

deletions | additions

\title{Extracts \abstract{Extracts from various articles} Appearing in a rather random order. Will tidy up later... \chapter{Introduction} \section{Introduction} From Michael Stonebraker's \textit{Red Book} \cite{red-book}: The problem was initially refer to as Extract - Transform - Load. The basic methodology was to: \begin{itemize}

According to the author the real problem is an end-to-end system that has to be tested on real world enterprise data, to make sure that it solves a real problem and not just a "point problem". \chapter{Ingest} \section{Ingest} Ingesting boils down to parsing data structures. Such connectors are generally expensive to construct, but various of them are available (haven't hound any yet!). An interesting challenge would be to semi-automatically generate such connectors. A common trend and area of active research is the extraction of data from the Web, either from web tables or web forms. (Google web tables). \chapter{Data \section{Data Transformation} see Pooter's Wheel: An interactive data cleaning system, DataXFormer: Leveraging the web for semantic transformations. \chapter{Data \section{Data Cleaning} see: Trends in Cleaning Relational Data, Tamer: Data curation at scale, Holistic Data Cleaning: Putting violations into context. Use of formal rules to define how data should "look like". These rules vary in expressiveness and can be distinguished in:

\item Data currency: Timeliness. \end{itemize} \chapter{Schema \section{Schema Matching} \chapter{Entity \section{Entity Consolidation} \chapter{Privacy \section{Privacy Preserving Record Linkage} Karakasidis (and others) tackles the special case where the data to be integrated is shared among parties and privacy preservation issues arise. \textit{Privacy Preserving Record Linkage} is the problem where data from two or more heterogeneous data sources are integrated in such a way that after the integration process the only extra knowledge that each source gains related to the records which are common to the participating sources. Relative to the above is \textit{Differential privacy}; a methodology that lets us concretely reason about privacy-budgeted data analysis (for nice examples justifying this need, refer to \cite{social-genome-2014-chang-kum}). An algorithm satisfies differential privacy if, for any two datasets D1 and D2 that differ in one row (they are \textit{close}), the ratio of the likelihood of the algorithm resulting in the same output starting from D1 and D2 is bounded by at most $e^\epsilon$. \chapter{Extras} \section{Extras} MacroBase \cite{macrobase-2015} (under review) proposes an end-to-end monitoring system for IoT devices. Its main idea is based on identifying outliers from a stream of input from sensors belonging to the same family, to pinpoint devices that may have failed or interesting events. Identify and highlight unusual and suprising data; analytic monitoring. MacroBase consists of a customizable pipeline of outlier detection, summarization and ranking operators. To increase efficiency and accuracy it implements severral cross-layer optimizations accross \textit{robust estimation}, \textit{pattern mining} and \textit{sketching procedures}. The design choices arise from the observation that IoT data has some distinct properties: \begin{enumerate} \item Data produced by IoT applications often exhibits regular structure (comes from an ordinary distribution).