Fernando Chirigati

This is a review of manuscript CiSESI-2018-02-0016 submitted to Computing in Science & Engineering: "CoRR — The Cloud of Reproducible Records" (Congo, Traoré, Hill and Wheeler, 2018).OverviewThe paper presents CoRR (Cloud of Reproducible Records), a Web platform for storing and managing records from different tools that create snapshots of computational environments for reproducibility purposes. The authors refer to these tools as CVC (Computation Version Control) tools. In a nutshell, these tools capture the state of the environment in which computational environments are run (e.g.: OS and hardware information, library dependencies, system variables, etc.), in addition to the code and the data. Examples of CVC tools include Sumatra, ReproZip, and CDE. The authors argue that CVC tools are facing major issues in adoption, and that one of the main reasons is related to the lack of a Web interface for sharing and managing CVC records (similar to what GitHub or BitBucket do for SVC tools). CoRR was designed to fill this gap and to facilitate the integration among these tools by providing a common management platform.A common platform for these tools is indeed interesting and useful for reproducibility. However, the contributions of the manuscript are still not clear. More details are provided in the next section, but here is a summary of the main issues: The differences between CoRR and existing data repositories need to be made more clear. The name CVC is misleading.The way that the metadata is stored in the platform is not clear.Diffs in the platform are manual rather than automatic, and there is no discussion on the challenges related to these diffs.Related work about provenance, workflows, and repositories are missing.My recommendation is "Author Should Prepare A Major Revision For A Second Review".Detailed ReviewI should note that I tried to get access to the platform, but I wasn't able to (no confirmation email was sent as of yet). I also tried to use the search feature in the main Website, but it keeps loading after pressing the return key and no results are returned.1. The differences between CoRR and existing data repositories need to be made more clear.CoRR is a repository of computation records. But what makes it different from other data repositories? This is still not clear to me.It seems that one of the main benefits of CoRR is the ability of exposing the metadata that the CVC tools capture, and allowing these to be queryable. For instance, in a regular data repository, if I want to search for projects that used scikit-learn, I would only be able to find such information if it were present in the description of the artifacts. On the other hand, in CoRR, one could make this information automatically available for querying, since tools like Sumatra or ReproZip capture such dependencies.Are these metadata indeed queryable in CoRR? If yes, this is a major benefit and should be made more explicit in the paper. In general, the paper would benefit from a section where authors explicitly discuss the main differences between CoRR and existing data repositories when it comes to CVC tools, i.e., why would someone choose to use CoRR and not any of the existing repositories?2. The name CVC is misleading.CVC stands for Computation Version Control, but neither ReproZip nor CDE do version control: they do create a snapshot of the computation, but they do not have a mechanism for version control. The authors seem to be referring to tools that capture provenance related to the computational environment, but not necessarily that provide version control, so the nomenclature should be changed.3. The way that the metadata is stored in the platform is not clear.The section "Adaptive and Open Database Model" was not clear enough to me. How are different metadata (from different tools) stored in a single data store? The authors do present the MongoDB's models, but there are no details on how different metadata are integrated into a single model. And why not use some representation like PROV (https://www.w3.org/TR/prov-primer/) for integrating the models?4. Diffs in the platform are manual rather than automatic, and there is no discussion on the challenges related to these diffs.At the end of the paper, the authors discuss the concept of diff as a way to tell whether a computation X is a replicate, a repeat, or a reproduction of a computation Y. This is a really cool feature, but it looks like users in CoRR need to define such diffs manually, which is certainly not scalable when dealing with hundreds of computations. Automatically figuring out if two computations are similar in terms of reproducibility is challenging, in particular if they were captured by different tools. But this is certainly a very useful feature for a repository such as CoRR. I was expecting at least a more detailed discussion about this.5. Related work about provenance, workflows, and repositories are missing.Capturing provenance from computations is certainly not a novel topic, and some references are missing. Authors should acknowledge scientific workflow management systems (e.g.: Taverna \citep{Missier_2010}, Kepler \citep{Ludascher:2006:SWM:1148437.1148454}, and VisTrails \citep{vo2011}), since they are known for capturing provenance from experiments \citep{Davidson_2008} \citep{Freire_2008}. In terms of representing provenance information, the authors should take a look at PROV (https://www.w3.org/TR/prov-primer/). There are also other tools that capture provenance from computational environments: although some might not be widely adopted, it is worth mentioning them. These are: PTU \citep{pham}, CARE \citep{Janin_2014}, Arnold \citep{186206}, and noWorkflow \citep{Murta_2015}.I also recommend taking a look at the related work section of these papers to see if there are additional relevant references as well.Finally, since CoRR is a repository, it is important to acknowledge existing collaborative data repositories, e.g.: Dataverse, figshare, OSF, etc. Again, as I mentioned before, it is important to provide a detailed comparison against these repositories.Additional CommentsThe authors mention in the Introduction that the lack of a platform for storing and managing computation records is probably one of the main reasons for the slow adoption of CVC tools. However, there is no evidence for that. Is there any reference or further study that the authors can provide to back this up? The motivation is not clear.There are other tools (including noWorkflow) supported by CoRR (Figure 3). Why aren't these tools mentioned in the paper?Using forked repositories from the existing tools might not be ideal. It might be hard to keep updating these repositories as the original ones keep changing. For instance, a user might need the latest features of Sumatra, but the CoRR-related Sumatra repository might be a few commits behind. Are there any thoughts on that? Why not just have a standalone CoRR software that reads project information from Sumatra / ReproZip / CDE and uploads data to the platform?Disclaimer