GADFLY: A pandas-based Framework for Analyzing GADGET Simulation Data




In the past decade, astrophysical simulations have increased dramatically in both scale and sophistication, and the typical size of the datasets produced has grown accordingly. However, the software tools for analyzing such datasets have not kept pace, such that one of the primary barriers to exploratory investigation is simply manipulating the data. This problem is particularly acute for users of the popular smoothed particle hydrodynamics (SPH) code gadget (Springel et al., 2001; Springel, 2005). Both gadget and gizmo (Hopkins, 2015), which uses the same data storage format, are widely used to investigate a range of astrophysical problems; unfortunately this also leads to fractionation of the data storage format as each research group modifies the output to suit its needs. This state of affairs has historically forced significant duplication of effort, with individual research groups separately developing their own unique analysis scripts to perform similar operations.

Fortunately, the issue of data management and analysis is not endemic to astronomy, and the resulting overlap with the needs of the broader scientific community and the industrial community at large provides a large pool of scientific software developers to tackle these common problems. In recent years, this broader community has settled on python as its programming language of choice due to its efficacy as a ‘glue’ language and the rapid speed of development it allows. This has led to the development of a robust scientific software ecosystem with packages for numerical data analysis like numpy (van der Walt et al., 2011), scipy (Jones et al., 2001), pandas (McKinney, 2010), and scikit-image ; matplotlib (Hunter, 2007) and seaborn for plotting; scikit-learn for machine learning, and statistics and modeling packages like scikits-statsmodels , pymc , and emcee (Foreman-Mackey et al., 2013).

Python is quickly becoming the language of choice for astronomers as well, with the Astropy project (Robitaille et al., 2013) and its affiliated packages providing a coordinated set of tools implementing the core astronomy-specific functionality needed by researchers. Additionally, the development of flexible python packages like yt (Turk et al., 2011) and pynbody (Pontzen et al., 2013), capable of analyzing and visualizing astrophysical simulation data from several different simulation codes, have greatly improved the ability of computational researchers to perform useful, insight-generating analysis of their datasets.

Recently, the scientific python community has begun to converge on the DataFrame provided by the high-performance pandas data analysis library as a common data structure. As a result, once data is loaded into a DataFrame , it becomes much easier to take advantage of the powerful analysis tools provided by the broader scientific computing ecosystem. With this in mind, we present a pandas -based framework for analyzing gadget simulation data, gadfly: the GAdget DataFrame LibrarY. Rather than providing an alternative to the existing yt and pynbody projects, the aim of the gadfly project is to ease interoperability with the python ecosystem at large, lowering the barrier for access to the tools created by this broader community.

In this paper we present the first public release (v0.1) of gadfly , which is available at . The framework design and organizational structure are outlined in Section \ref{sec:framework}, followed by a description of the included SPH particle rendering in Section \ref{sec:vis}. Our plans for future development are outlined in Section \ref{sec:future}, and a summary is provided in Section \ref{sec:summary}.

\label{fig:usage_example} Initializing a simulation, defining a PartType dataset, and loading data.

A Framework built on pandas


There are several motivations for building an analysis framework around the pandas DataFrame. The guiding principle underlying the design of this framework is to enable exploratory investigation. This requires both intelligent memory management for handling out-of-core datasets, and a robust indexing system to ensure that particle properties do not become misaligned in memory. Using the pandas DataFrame as the primary data container rather than separate numpy arrays makes it much easier to keep different particle properties indexed correctly while still affording the flexibility to load and remove data from memory at will. In addition, pandas itself is a thoroughly documented, open-source, BSD-licensed library providing high-performance, easy-to-use data structures and analysis tools, and has a strong community of developers working to improve it. More broadly, as pandas is becoming the de-facto standard for data analysis in python, doing so simplifies interoperability with the rest of the available tools.

Gadfly is designed for use with simulation data stored in the HDF5 format1. While we otherwise aim to keep gadfly as general as possible, some assumptions about the storage format are necessary. Each particle type is expected to be contained in a different HDF5 group, labeled PartType0, PartType1, etc; a Header group is also expected, containing metadata for the simulation snapshot as HDF5 attributes. All particles are expected to have the following fields defined: particle IDs, masses, coordinates, and velocities. SPH particles are additionally expected to have a smoothing length, density, and internal energy. Additional fields not included in the original gadget specification, such as chemical abundances, are also supported.

Here, we provide an overview of the design and capabilities of the gadfly framework, including the Simulation, Snapshot, and PartType DataFrame objects at the core of gadfly (Section \ref{sec:hierarchy}), the usage of which is demonstrated in Figure \ref{fig:usage_example}. Our approach to file access and intelligent memory management (Section \ref{sec:fileIO}), our handling of unit conversions (Section \ref{sec:units}) and coordinate transformations (Section \ref{sec:coords}), and the included utilities for parallel batch processing (Section \ref{sec:parallel}) are also described.