What to Keep and How to Analyze It: Data Curation and Data Analysis with Multiple Phases


This open document is being used to describe and record the events at the Radcliffe Exploratory Seminar on Data Curation and Analysis, to be held at the Radcliffe Institute for Advanced Study, May 9-10 2013.

This Google Drive Directory should be used to deposit all files contributed by participants before and during the meeting. (Click "Open in Drive" on your browser to make a new folder, e.g. with your name as its name.)

This Google Doc is used for collaborative real-time note-taking.

ABSTRACT: Rapid advances in technology have allowed us to collect vast amounts of data in myriad fields and forms, but our ability to manage and analyze these data has not kept pace. As a result, the amount of data collected far exceeds what can be analyzed and, often, what can be archived. These issues only become more pressing as data collection accelerates. Astronomers and astrophysicists, for example, collect terabytes of data per night; the phrase “drowning in a data tsunami” is increasingly used to describe this situation. The issues of what to keep and what to distribute are surprisingly complex, even when we put aside technological issues such as long-term storage and retrieval. A central challenge is the fundamental conflict between reducing the size of data and preserving information for future scientific inquires and statistical analyses. Complicating matters further, the parties/teams involved in the entire data collection, curation, and analysis process often have only limited communication with each other owing to the sequential nature of this process. This seminar brings together a core group of leading experts and emerging scholars in information and natural sciences to discuss, debate, and design principles and strategies to address this grand challenge, which increasingly affects almost every aspect of science and society.

GOAL: By gathering experts from information and natural sciences, we aim to start building a set of principles and methods that will allow us to understand such problems and to provide better preprocessing, analyses, and data preservation, especially in the context of the natural sciences. The ultimate goals of this research include providing methods for assessing the validity of such collaborative analyses, guidance on statistically-principled preprocessing, and a rich new theory of statistical learning and inference with multiple parties. We believe that this collaboration will simultaneously sow the seeds for innovative mathematical theory and shed light on directly usable guidelines for the construction and curation of scientific databases.

Draft Schedule of Events, May 9-10, 2013

Location: Room 112, Radcliffe Gymnasium, Radcliffe Yard, 18 Mason Street, Cambridge, MA (Red pin on this map marks the front door of the Radcliffe Gymnasium--zoom in!)

Day 1 (Thursday, May 9)

8:30 AM - 9:00 AM Continental Breakfast

9:00 AM Introductory remarks and welcome address

SESSION I 9:15 AM – 12:30 PM Quantitative and qualitative perspectives on multiphase science – Beginning a dialogue

9:15-11:45 Introductions: each of 16 participants will answer the following questions (5 min/person, including short discussions & coffee break, total of 2.5 hours.)

  1. What about your background gives you an interest in data curation?
  2. What do you think is the most important opportunity good data curation offers? (Please just one!)
  3. What do you think is the biggest danger facing scientific research today if we don't improve data curation? ((Please just one!)

Coffee Break at appropriate stopping point during the above, at roughly at 10:30.

11:45-12:30 Introduction to solutions proposed in the literature (Part I) Presented by: Meng, Borgman, Crosas, Pepe et al. (TBD)

12:30 PM – 1:30 PM Lunch

1:30-2:00 Introduction to solutions proposed in the literature (Part II) Presented by: Meng, Borgman, Crosas, Pepe et al. (TBD)

SESSION II 2:00 PM – 5:00 PM Specific challenges in data curation, provenance, and multiphase analysis

2:00 PM--4:00 PM

Roughly 40 minutes for each of the topics below (as amended at the Workshop). Suggested discussion leaders indicated, but changes can and will(!) be made to respond to participant suggestions. Each workshop attendee will each "sign up" (at lunchtime) for 3 discussions total, to be held within groups of roughly 5 or 6 people each. Multiple rooms will be available, and a schedule of which discussions will take p