Appendix 1: Original Workshop Justification

With the dramatic increases in the size, diversity, and complexity of data available for scientific discoveries, medical advances, education reforms and evidence-based policy making, the entire enterprise of scientific quantitative inquiry has been presented with unprecedented challenges and opportunities. In particular, the vast majority of current quantitative inquires are not made by a single individual or even a single team. The final scientific inference and, more generally, quantitative learning is a result of a multi-party effort, with teams/parties entering the process sequentially over several phases (e.g. data collection, processing, curation, and analysis). Due to practical constraints such as resource limitations and confidentiality, each team involved in a given phase may not have full knowledge of the assumptions made by, and resources available to, those coming before or after it. This fact compels all of us involved in the production and preservation of scientific data to rethink the traditional paradigms of statistical analysis and data preservation. These have been built around two ideas: (1) the academic paper as the primary repository of scientific knowledge and information, and (2) the analysis of data beginning (and ending) with a single team, who has essentially full knowledge of the data’s origins and all assumptions made in its genesis.

Shifts in the scientific landscape call for revision of both of these ideas. Projects in astronomy, biology, ecology, and social sciences (to name a small sampling) are increasingly focused on building databases for future analyses as a primary objective. These projects must decide what levels of preprocessing to apply to their data and what additional information to provide to their users. Clearly, providing all of the original data allows the most flexibility in subsequent analyses. In practice, the journey from raw data to a complete analysis is typically too intricate and problematic for the majority of users, who instead choose to use preprocessed output. Unfortunately, decisions made at this stage can be quite treacherous from a statistical perspective because of the potential for serious information loss and/or information distortion.

Scientific data released to end-users almost always undergo editing, imputation, and other forms of preprocessing before they are analyzed. When such steps are taken, the data analysis becomes a collaborative endeavor by all parties involved in data collection, preprocessing, and analysis. Such settings are rife with subtleties and pitfalls. Teams subsequently handling those data do not and often cannot have a perfect understanding of the entire phenomenon at hand; the final results will inevitably contain some combination of their judgments, and some preprocessing can irreversibly destroy information from the raw data. By gathering experts from information and natural sciences, we aim to start building a set of principles and methods that will allow us to understand such problems and to provide better preprocessing, analyses, and data preservation, especially in the context of the natural sciences. The ultimate goals of this research include providing methods for assessing the validity of such collaborative analyses, guidance on statistically-principled preprocessing, and a rich new theory of statistical learning and inference with multiple parties. We believe that this collaboration will simultaneously sow the seeds for innovative mathematical theory and shed light on directly usable guidelines for the construction and curation of scientific databases.

Defects incurred by earlier parties may cause more damage than those in subsequent analyses, just as problems in the data collection stages are usually harder to address than problems in the analysis stage. This is especially true when some of those steps are “irreversible”. An example of great current interest in astronomy and astrophysics concerns the use of data from Chandra X-ray Observatory. As described in the Chandra documentation (http://cxc.harvard.edu/ciao/dictionary/sdp.html), the “Chandra data” come with different level of processing, from Level 0 “raw data”, which are not recommended for analysis, to Level 3 “higher lever information” available to public, where the Level 2 data processing is considered to be irreversible, which was defined as “By ‘irreversible’ we mean that information that has been lost cannot be regained from the L2 products alone.” Evidently judgments have been made regards what to retain and what to discard, and as such assessing their impact on the subsequent analyses is of great importance for the so called V&V (Verification and Validation) process. Indeed, the question of “what to keep” has been a much debated and discussed topic in the rapidly growing literature on data curation, yet currently there is few collaboration between fields with overlapping interests in this area. For example, statisticians have been largely absent such discussion and debates.

Such collaborations would appear quite natural given the complementary strengths of the participants. Literature in the field of data curation has largely focused on describing how scientists use data, their motivations for data sharing, and the organizational and cultural issues involved in implementing better data curation practices. Simultaneously, computer scientists are developing technical solutions to enable tracking of data provenance and easier access to scientific resources, to name only a few directions. Statisticians are interested in developing principled statistical methods for these situations. These lines of research are distinct, but they provide necessary complements for each other and could benefit immensely from greater communication and collaboration.

As a specific example of the fundamental restructuring needed to address the aforementioned grand challenge, consider the current paradigm for conducting and evaluating statistical inferences. Statisticians are trained to regard their mathematical models as approximations to a true underlying reality. Consequently, these models are typically not designed to capture the journey from data collection to data analysis. This is very problematic because such journeys necessarily involve judgments and data preprocessing from other teams. If the assumptions made and procedures used in this preprocessing phase are incompatible with those used in the final analysis (so-called “uncongeniality” in the literature of statistical analysis), then the current statistical framework is ineffective, or, at worst, entirely inapplicable. In particular, standard notions such as estimation consistency and unbiasedness become misguided mathematical idealizations. They are misguided because they do not take into account the fact that even if every team in this sequence has reached the perfect answer given their available information and resources, the lack of mutual knowledge can still make the final output significantly inferior to that possible using all the information available to every team. Yet it is clear that we still can and should have a theoretical foundation for comparing different methods in such environments. In mathematical terms, we need to reformulate our criteria by taking into account additional practical constraints and then seek the most effective methods, instead of comparing methods using a criterion that none can ever satisfy. A general statistical framework for this purpose is now being built. This development can greatly benefit from the input and perspectives of the data curation community, which has a much better understanding of the practical constraints and goals involved in these collaborative research settings.

Conversely, approaches to data curation would benefit greatly from the involvement of statisticians. Scientists and librarians alike often rely on general principles of future utility to base decisions on what to select and on what to keep, rather than on analyses of the actual trends in data or on demonstrated utility. As a concrete example, at the Center for Embedded Networked Sensing, a five-university NSF Science and Technology Center based at UCLA, the involvement of a statistician (Mark Hansen, a suggested Seminar attendee) midway through the Center’s lifespan radically changed the course of data collection and data curation. Scientists changed their data collection, storage, and retrieval methods, and involved their information science partners in developing better data curation and management methods.

In a nutshell, Radcliffe Exploratory Seminar provides an ideal forum for intense interdisciplinary exchanges on emerging challenges that truly require collaborations from multiple disciplines in order to make meaningful headways. As far as we are aware, if funded, this would be the first workshop that brings leading computer scientists, information scientists, natural scientists, and statisticians under one roof to address some of most intellectually stimulating and practically challenging problems of the information age.