Cloud-Native Repositories for Big Scientific Data

Ryan Abernathey; Tom Augspurger; Anderson Banihirwe; Charles C Blackmon-Luca; Timothy J Crone; Chelle L Gentemann; Joseph J Hamman; Naomi Henderson; Chiara Lepore; Theo A Mccaie; Niall H Robinson; Richard P Signell

doi:10.22541/au.160443768.88917719/v1

loading page

Cloud-Native Repositories for Big Scientific Data

Ryan Abernathey,
Tom Augspurger,
Anderson Banihirwe,
Charles C Blackmon-Luca,
Timothy J Crone,
Chelle L Gentemann,
Joseph J Hamman,
Naomi Henderson,
Chiara Lepore,
Theo A Mccaie,
Niall H Robinson,
Richard P Signell

Abstract

Scientific data has traditionally been distributed via the "download model," in which scientists bring datasets to the personal computers for analysis. But this way of working suffers from major limitations as scientific datasets grow towards the Petabyte scale. This article discusses the potential of cloud computing to accelerate scientific research and defines the concept of a "cloud native data repository," distinct from a traditional data repository. We enumerate the objectives for such repositories-performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access & inclusion-and use these objectives to define a set of best practices for cloud native data repositories, focusing on the importance of analysis-ready data, cloud-optimized formats, and loose coupling with data-proximate computing. We describe a prototype implementation of these principles by the Pangeo Project using open source scientific python tools. We conclude by discussing some practical challenges for the future development of cloud native data repositories.

03 Nov 2020Submitted to Computing in Science and Engineering

18 Jan 2021Published in Computing in Science and Engineering

01 Mar 2021Published in Computing in Science & Engineering volume 23 issue 2 on pages 26-35. 10.1109/MCSE.2021.3059437

Abstract

Peer review status:Published