Research Center as Distant Publisher: Developing Non-Consumptive Compliant Open Data Worksets to Support New Modes of Inquiry

Robert H. McDonald-Indiana University, Bloomington, IN USA


The HathiTrust Research Center (HTRC), founded in 2010, is managed by Indiana University Bloomington and the University of Illinois at Urbana-Champaign under an agreement with the HathiTrust Board of Governors and the University of Michigan. The HTRC mission supports new knowledge creation through novel computational uses of the Hathitrust Digital Library (HTDL). Through the introduction of the concept of distant publishing, this short paper will discuss ideas for data and software publication that support the HTRC non-consumptive research methodologies and offer scholars new methods for research inquiry.


In the original Google Books Settlement Agreement in 2008 (Courant 2009), funds were to be set aside to create a research center that would enable researchers worldwide to accomplish data-mining and analysis on texts in the public domain and under copyright in a manner that was secure and compliant with appropriate U.S. copyright law. This did not happen, because the court rejected the agreement in 2011. Despite this, in 2011, the HTDL announced that Indiana University Bloomington and the University of Illinois at Urbana-Champaign would run the HTRC under a cooperative funding agreement with the HathiTrust Board of Governors and the University of Michigan. Since 2014, HTRC has made available as an active production service tools to analyze a set of out-of-copyright content equaling around 4.4 million volumes. In 2016, the HTRC plans to enable analysis of the entirety of the 14 million volume corpus currently held by the HTDL, the largest digital academic library in North America.

HTRC and Non-Consumptive Research

The HTRC has developed a process to define and work within the concept of non-consumptive computational access to support the fair-use of the HTDL corpus as defined within the Google Books Settlement Agreement that was a part of the Authors Guild et al. v. Google Inc case.

Currently the HTRC defines the process for non-consumptive use of the HTDL corpus as:

Research in which computational analysis is performed on one or more books, but not research in which a researcher reads or displays.

Operationally, from the perspective of the HTRC research cyberinfrastructure, the HTRC defines non-consumptive research as:

That which requires that no action or set of actions on the part of users, either acting alone or in cooperation with other users over the duration of one or multiple sessions can result in sufficient information gathered from a collection of copyrighted works to reassemble pages from the collection.

This concept has been further refined in the course of the development of the HTRC Data Capsule (Zeng 2014) for secure data analysis and the development of the HTRC Workset Ontology (Jett 2016).

HTRC as Publisher

During the course of work with scholars using the HTRC tools and services to create derivative non-consumptive data sets, the Center has often taken on a set of the roles traditionally played by publishers. These data sets are reviewed by members of the HTRC staff for compliance with non-consumptive use standards prior to release to the authors.

As part of this work, the HTRC has offered as a service the capability to publish these non-consumptive, compliant data sets using a DOI scheme (Downie; 2015). This service enables the creation of new derivatives (Downie; 2015a) of published non-consumptive, compliant data sets.

A second benefit of opening access to these data sets is the ability to replicate current experiments that have been developed using the HTDL corpus and the HTRC tool set. From this standpoint the HTRC functions as a distant publisher of non-consumptive compliant data sets in support of new models of research inquiry.

Distant Publishing as Concept

Prior to defining the concept of distant publishing, it is first instructive to understand distant reading within the context of digital humanities. Distant reading was first codified in 2000 by noted humanist and scholar Franco Moretti:

Distant reading: where distance . . . is a condition of knowledge: it allows you to focus on units that are much smaller or much larger than the text: devices, themes, tropes – or genres and systems. And if, between the very small and the very large, the text itself disappears, well, it is one of those cases when one can justifiably say, less is more. (Moretti 2000)

Moretti later expanded the concept in his 2013 monograph of the same name (Moretti 2013).

Much like Moretti’s definition that focuses on enabling a broader view of the text, the distant publisher enables a broader view of data sets through bringing to bear the current corpus of computational tools for large-scale textual data mining and analysis. HTRC as a distant publisher is removed by at least one degree from the creator, and remains distinct from any standardized concept of publisher. Yet, data sets are published under the rubric of the HTRC, and these publications are freed from the constraints of copyright in this context due to their non-consumptive nature. Thus we define distant publishing as

Publication of a non-consumptive data set outside of any standardized publishing construct, removed by \(x\) degree from the original creator,
openly available to the community of scholars for replication and available for re-use in support of the advancement of knowledge.

This definition is one that the HTRC aims to further refine in the coming years. We welcome broader thoughts on this concept from those working to preserve open research data and the software that makes that data accessible for use in scientific experimental replication and re-use for the long-term benefit of the scholarly community.