Research Center as Distant Publisher: Developing Non-Consumptive Compliant Open Data Worksets to Support New Modes of Inquiry

Robert H. McDonald-Indiana University, Bloomington, IN USA


The HathiTrust Research Center (HTRC), founded in 2010, is managed by Indiana University Bloomington and the University of Illinois at Urbana-Champaign under an agreement with the HathiTrust Board of Governors and the University of Michigan. The HTRC mission supports new knowledge creation through novel computational uses of the Hathitrust Digital Library (HTDL). Through the introduction of the concept of distant publishing, this short paper will discuss ideas for data and software publication that support the HTRC non-consumptive research methodologies and offer scholars new methods for research inquiry.


In the original Google Books Settlement Agreement in 2008 (Courant 2009), funds were to be set aside to create a research center that would enable researchers worldwide to accomplish data-mining and analysis on texts in the public domain and under copyright in a manner that was secure and compliant with appropriate U.S. copyright law. This did not happen, because the court rejected the agreement in 2011. Despite this, in 2011, the HTDL announced that Indiana University Bloomington and the University of Illinois at Urbana-Champaign would run the HTRC under a cooperative funding agreement with the HathiTrust Board of Governors and the University of Michigan. Since 2014, HTRC has made available as an active production service tools to analyze a set of out-of-copyright content equaling around 4.4 million volumes. In 2016, the HTRC plans to enable analysis of the entirety of the 14 million volume corpus currently held by the HTDL, the largest digital academic library in North America.

HTRC and Non-Consumptive Research

The HTRC has developed a process to define and work within the concept of non-consumptive computational access to support the fair-use of the HTDL corpus as defined within the Google Books Settlement Agreement that was a part of the Authors Guild et al. v. Google Inc case.

Currently the HTRC defines the process for non-consumptive use of the HTDL corpus as:

Research in which computational analysis is performed on one or more books, but not research in which a researcher reads or displays.

Operationally, from the perspective of the HTRC research cyberinfrastructure, the HTRC defines non-consumptive research as:

That which requires that no action or set of actions on the part of users, either acting alone or in cooperation with other users over the duration of one or multiple sessions can result in sufficient information gathered from a collection of copyrighted works to reassemble pages from the collection.

This concept has been further refined in the course of the development of the HTRC Data Capsule (Zeng 2014) for secure data analysis and the development of the HTRC Workset Ontology (Jett 2016).

HTRC as Publisher

During the course of work with scholars using the HTRC tools and services to create derivative non-consumptive data sets, the Center has often taken on a set of the roles traditionally played by publishers. These data sets are reviewed by members of the HTRC staff for compliance with non-consumptive use standards prior to release to the authors.

As part of this work, the HTRC has offered as a service the capability to publish these non-cons