Data Integration - Preliminaries
Appearing in a rather random order. Will tidy up later...
From Michael Stonebraker’s Red Book (citation not found: red-book): The problem was initially refer to as Extract - Transform - Load. The basic methodology was to:
construct a global schema
write a connector to do the extraction
write transformations to cleanup the data
load the data in the data warehouse
Experience (citation not found: red-book) shows that the previous methodology lacks severely in scalability. Three big issues cause this:
A global schema is needed upfront
Too much manual labor
Data integration and cleaning is fundamentally difficult. Data is dirty - about 10% of your data is incorrect. Deduplication is hard.
Data Integration is the following steps:
Ingest: Locate and capture data source. Parse whatever data structure is used for storage. Read and store in a common place.
Clean: Find and rectify data errors (may have to ask an expert).
Transform: Euros to dollars, airport code to city name, date of birth to age.
Schema Inegration: wages -salary, likes - hobbies, person - employee.
Data consolidation(data deduplication): Find and merge duplicates.
According to the author the real problem is an end-to-end system that has to be tested on real world enterprise data, to make sure that it solves a real problem and not just a “point problem”.
Ingesting boils down to parsing data structures. Such connectors are generally expensive to construct, but various of them are available (haven’t hound any yet!). An interesting challenge would be to semi-automatically generate such connectors. A common trend and area of active research is the extraction of data from the Web, either from web tables or web forms. (Google web tables).
see Pooter’s Wheel: An interactive data cleaning system, DataXFormer: Leveraging the web for semantic transformations.
see: Trends in Cleaning Relational Data, Tamer: Data curation at scale, Holistic Data Cleaning: Putting violations into context. Use of formal rules to define how data should “look like”. These rules vary in expressiveness and can be distinguished in:
Conditional Functional Dependencies.
These declarative forms can represent relations and constraints among rows and columns in the dataset.
Based on the above mentioned formalisms, identify any violations of the specified rules.
Devise a plan to rectify them. The plan is usually based on a metric to minimize the required repair operations to be applied.
Apply the repairs and re-iterate. Make sure that the repairs haven’t created new conflicts among the data.
When the dataset conforms with the specified rules, or a minimum threshold of error tolerance is achieved stop the cleaning procedure.
From Trends in Cleaning Relational Data
Data consistency: Validity and integrity of data.
Data accuracy: How accurate are data values in the database with respect to the true values.
Data completeness: Is all the data needed to meet the information needs available?
Data currency: Timeliness.
Karakasidis (and others) tackles the special case where the data to be integrated is shared among parties and privacy preservation issues arise. Privacy Preserving Record Linkage is the problem where data from two or more heterogeneous data sources are integrated in such a way that after the integration process the only extra knowledge that each source gains related to the records which are common to the participating sources. Relative to the above is Differential privacy; a methodology that lets us concretely reason about privacy-budgeted data analysis (for nice examples justifying this need, refer to (citation not found: social-genome-2014-chang-kum)). An algorithm satisfies differential privacy if, for any two datasets D1 and D2 that differ in one row (they are close), the ratio of the likelihood of the algorithm resulting in the same output starting from D1 and D2 is bounded by at most \(e^\epsilon\).
MacroBase (citation not found: macrobase-2015) (under review) proposes an end-to-end monitoring system for IoT devices. Its main idea is based on identifying outliers from a stream of input from sensors belonging to the same family, to pinpoint devices that may have failed or interesting events. Identify and highlight unusual and suprising data; analytic monitoring. MacroBase consists of a customizable pipeline of outlier detection, summarization and ranking operators. To increase efficiency and accuracy it implements severral cross-layer optimizations accross robust estimation, pattern mining and sketching procedures. The design choices arise from the observation that IoT data has some distinct properties:
Data produced by IoT applications often exhibits regular structure (comes from an ordinary distribution).
Interesting behaviors are often systemic (many data points will exhibit the trend).
Despite being systemic, interesting behaviors are often relatively rare (interesting events are uncommon).
Zachary G. Ives in (citation not found: cidr2015-Ives) says that a view at scale yields many benefits, and this is evident in (citation not found: ieee-3-googlers). “Follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.”
Georgia Kapitsaki in (citation not found: kapitsaki-2015) proposes a context extraction technique from existing datasets. The most popular definition of context, given by Dey and Abowd :Context is any information that can be used to characterize the situation of an entity. An entity is a person, place, or object that is considered relevant to the interaction between a user and an application, including the user and applications themselves. The use of context to provide relevant information and/or services to the user, where relevancy depends on the user’s task, is known as context-awareness.