Authorea

Jacob Hummel edited Introduction.tex about 8 years ago

Commit id: cb4c1e966d260505468055cd48fc7b0c52e85efd

deletions | additions

In the past decade, astrophysical simulations have increased dramatically in both size and sophistication, and the typical size of the datasets produced has grown accordingly. However, the software tools for analyzing such datasets have not kept pace, such that one of the primary barriers to exploratory investigation is simply manipulating the data. This problem is particularly acute for users of the popular smoothed particle hydrodynamics (SPH) code GADGET \textsc{gadget} \citep{SpringelYoshidaWhite2001,Springel2005}. GADGET is widely used to investigate a range of astrophysical problems; unfortunately this also leads to fractionation of the data storage format as each research group modifies the output to suit their needs. This state of affairs has historically forced significant duplication of effort, with individual research groups separately developing their own unique analysis scripts to perform similar operations.

Recently, the scientific \code{python} community has begun to converge on the \code{DataFrame} provided by the high-performance \code{pandas} data analysis library as a common data structure for the ecosystem. As a result, once data is loaded into a \code{DataFrame}, it becomes much easier to take advantage of the powerful analysis tools provided by the broader scientific computing ecosystem. With this in mind, we present \textsc{gadfly}: a pandas-based framework for analyzing \textsc{gadget} files: the GADGET dataframe library, or GADFLY. This project is in no way expected to be a replacement for the far more feature-complete yt or pynbody projects. Rather, we focus instead on implementing the minimum functionality necessary to interface between simulation data in the GADGET HDF5 format, and the pandas data analysis library.%Adoption of the platform-independent Hierarchical Data Format (HDF5) for data storage helps mitigate some of these issues, being able to load a dataset into memory is only the first step in performing useful, insight-generating analysis.