Astronomical Data Curation and Analysis: What has worked and what has not?


Abstract. Here we will explore various aspects of long term and high volume data curation and analysis. What has worked in Astronomy? What has not? What does it mean to say something has worked? What do astronomers think? What are the financial, social, technical, and practical issues that limit how data can be used in the future?

Successful Curation and Analysis

What does it mean to say that a project has been successful in data curation and analysis?

Institutions like HEASARC and IPAC play a critical role in maintaining data compatibility.

What kind of problems are faced by data curators?


example of MWA (Ashish)

Standards drift

EXOSAT responses are not usable anymore because software has shifted.

Documentation loss

Lots of things that are “obvious” are not written down, or are not written down clearly. How much effort will someone make to track down the right procedures?


How much effort will people and institutions make to bring their data products to be fully consistent with everything else?

Data volume

SDO/AIA is not keeping unprocessed data because that takes up too much space. What if calibration improves? What happens to old data which had the wrong calibration? Tough luck.


Software and platforms keep changing. Tools must be simple to use, comprehensive, and kept up to date.

Suggested Solutions

Solutions suggested by various people.

Focus on Bayesian Methodology

Keep the data in a form that makes it possible to generate full posterior densities at any time. But is this practical? How much original data can be kept? The closer the data get to the instrument, the more calibration products must be held, and the greater the chances of breaks in the chain.

How do you know ahead of time how the data will be used? e.g., one could imagine storing the full posterior density of source intensity in a catalog, but this says nothing about undetected sources. Imagine if someone 100 years from now is looking through LSST images for a supernova progenitor, and that they have magical statistical tools to suppress false positives and push detection threshold to currently unimaginable levels – what kind of input data would they need? We have no way of knowing.

Constant Maintenance

Necessary, but costs a lot.


The following survey was sent out CfA-wide:

For the “Radcliffe Exploratory Seminar on Data Curation and Analysis”, we have been asked to lead a discussion on the topic: “Astronomical Data Curation and Analysis: What has worked and what has not?”

So that we can better understand and present the views of the astronomy community, we have developed a short list of questions. If you have views on this issue, please answer by e-mail. Even if you have comments on just a single question, we would like to hear from you. Feel free to add comments on areas we have not mentioned. We will be able to take into account for the discussion any comments we receive by 5 PM Wednesday.

Please let us know if we can quote comments you make, or if you would prefer to remain anonymous.

With thanks,

Rosanne Di Stefano, Vinay Kashyap, and Ashish Mahabal


Background: When we talk about data curation, we have to define what we mean by “success”. One measure of success is that the data can be used correctly and effectively by researchers who were not involved in the original project, even years after the data collection was complete.

  1. What old dataset have you personally had to reanalyze for different purposes than originally intended, and how much time did you spend on it? Did you consider the time expenditure to be reasonable?

  2. In your view, what are examples of successful data curation? What was done to achieve this success?

  3. In your view, what are examples of unsuccessful data curation? What do you think would have been needed to make the data curation more successful?

  4. What current project do you believe has the right procedures in place to enable the data, calibration, and documentation to be useful and relevant for unforeseen analysis 15-20 years from now? What is being done that gives you this confidence?

  5. Are there any examples of current projects for which you are concerned about the quality of data curation?

  6. Do you have comments about any of the following? (Please feel free to add comments on other topics you think are relevant.)

    • VO/standards (NVO/VAO/IVOA)


    • optical versus radio versus x-ray data curation

    • software aligned with curation

    • ADS-like services

    • WWT/Google sky and their use