loading page

Cloud-Native Repositories for Big Scientific Data
  • +9
  • Ryan Abernathey,
  • Tom Augspurger,
  • Anderson Banihirwe,
  • Charles C Blackmon-Luca,
  • Timothy J Crone,
  • Chelle L Gentemann,
  • Joseph J Hamman,
  • Naomi Henderson,
  • Chiara Lepore,
  • Theo A Mccaie,
  • Niall H Robinson,
  • Richard P Signell
Ryan Abernathey
Author Profile
Tom Augspurger
Anderson Banihirwe
Charles C Blackmon-Luca
Timothy J Crone
Chelle L Gentemann
Joseph J Hamman
Naomi Henderson
Chiara Lepore
Theo A Mccaie
Niall H Robinson
Richard P Signell


Scientific data has traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow towards the petabyte scale. A "cloud-native data repository," as defined in this paper, offers several advantages over traditional data repositories---performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access & inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing's full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.

Peer review status:Published

03 Nov 2020Submitted to Computing in Science and Engineering
18 Jan 2021Published in Computing in Science and Engineering
01 Mar 2021Published in Computing in Science & Engineering volume 23 issue 2 on pages 26-35. 10.1109/MCSE.2021.3059437