Strong Lens Time Delay Challenge: I. Experimental Design

Abstract

**Abstract**: The time delays between point-like images in gravitational lens systems can be used to measure cosmological parameters as well as probe the dark matter (sub-)structure within the lens galaxy. The number of lenses with measuring time delays is growing rapidly due to dedicated efforts. In the near future, the upcoming *Large Synoptic Survey Telescope* (LSST), will monitor \(\sim10^3\) lens systems consisting of a foreground elliptical galaxy producing multiple images of a background quasar. In an effort to assess the present capabilities of the community to accurately measure the time delays in strong gravitational lens systems, and to provide input to dedicated monitoring campaigns and future LSST cosmology feasibility studies, we pose a “Time Delay Challenge” (TDC). The challenge is organized as a set of “ladders,” each containing a group of simulated datasets to be analyzed blindly by participating independent analysis teams. Each rung on a ladder consists of a set of realistic mock observed lensed quasar light curves, with the rungs’ datasets increasing in complexity and realism to incorporate a variety of anticipated physical and experimental effects. The initial challenge described here has two ladders, TDC0 and TDC1. TDC0 has a small number of datasets, and is designed to be used as a practice set by the participating teams as they set up their analysis pipelines. The non mondatory deadline for completion of TDC0 will be December 1 2013. The teams that perform sufficiently well on TDC0 will then be able to participate in the much more demanding TDC1. TDC1 will consists of \(10^3\) lightcurves, a sample designed to provide the statistical power to make meaningful statements about the sub-percent accuracy that will be required to provide competitive Dark Energy constraints in the LSST era. In this paper we describe the simulated datasets in general terms, lay out the structure of the challenge and define a minimal set of metrics that will be used to quantify the goodness-of-fit, efficiency, precision, and accuracy of the algorithms. The results for TDC1 from the participating teams will be presented in a companion paper to be submitted after the closing of TDC1, with all TDC1 participants as co-authors.

As light travels to us from a distant source, its path is deflected by the gravitational forces of intervening matter. The most dramatic manifestation of this effect occurs in strong lensing, when light rays from a single source can take several paths to reach the observer, causing the appearance of multiple images of the same source. These images will also be magnified in size and thus total brightness (because surface brightness is conserved in gravitational lensing). When the source is time varying, the images are observed to vary with delays between them due to the differing path lengths taken by the the light and the gravitational potential that it passes through. A common example of such a source in lensing is a quasar, an extremely luminous active galactic nucleus at cosmological distance. From the observations of the image positions, magnifications, and the time delays between the multiple images we can measure the mass structure of the lens galaxy itself (on scales \(\geq M_{\odot}\)) as well as a characteristic distance between the source, lens, and observer. This “time delay distance” encodes the cosmic expansion rate, which in turn depends on the energy density of the various components in the universe, phrased collectively as the cosmological parameters.

The time delays themselves have been proposed as tools to study massive substructures within lens galaxies (Keeton et al., 2009), and for measuring cosmological parameters, primarily the Hubble constant, \(H_0\) (see, e.g., Suyu et al., 2013, for a recent example), a method first proposed by Refsdal (1964). In the future, we aspire to measure further cosmological parameters (e.g., dark energy) by combining large samples of measured time delay distances (citation not found: Linder2012). It is clearly of great interest to develop to maturity the powers of time delay lens analysis for probing the dark universe.

New wide area imaging surveys that repeatedly scan the sky to gather time-domain information on variable sources are coming online. Dedicated follow-up monitoring campaigns are obtaining tens of time-delays (REF). This pursuit will reach a new height when the *Large Synoptic Survey Telescope* (LSST) enables the first long baseline multi-epoch observational campaign on \(\sim\)1000 lensed quasars (citation not found: LSSTSciBook). However, to use the measured LSST lightcurves to extract time delays for accurate cosmology will require detailed understanding of how, and how well, time delays can be reconstructed from data with real world properties of noise, gaps, and additional systematic variations. For example, to what accuracy can time delays between the multiple image intensity patterns be measured from individual double- or quadruply-imaged systems for which the sampling rate and campaign length are given by LSST? In order for time delay errors to be small compared to errors from the gravitational potential, we will need the precision of time delays on an individual system to be better than 3%, and those estimates will need to be robust to systematic error. Simple techniques such as the “dispersion” method (Pelt et al., 1994; Pelt et al., 1996) or spline interpolation through the sparsely sampled data (e.g., Tewes et al., 2013) yield time delays which *may* be insufficiently accurate for a Stage IV dark energy experiment. More complex algorithms such as Gaussian Process modeling may hold more promise. None of these methods have been tested on large scale data sets.

At present, it is unclear whether the baseline “universal cadence” LSST sampling frequency of \(\sim10\) days in a given filter and \(\sim 4\) days on average across all filters (citation not found: LSSTSciBook) (citation not found: LSSTpaper) will enable sufficiently accurate time delay measurements, despite the long campaign length (\(\sim10\) years). While “follow up” monitoring observations to supplement the LSST lightcurves may not be feasible at the 1000-lens sample scale, it may be possible to design a survey strategy that optimizes cadence and monitoring at least for some fields. In order to maximize the capability of LSST to probe the universe through strong lensing time delays, we must understand the interaction between the time delay estimation algorithms and the anticipated data properties. While optimizing the accuracy of LSST time delays is our long term objective, improving the present-day algorithms will benefit the current and planned lens monitoring projects as well. Exploring the impact of cadences and campaign lengths spanning the range between today’s monitoring campaigns and that expected from a baseline LSST survey will allow us to simultaneously provide input to current projects looking to expand their sample sizes (and hence monitor more cheaply) as well as the LSST project, whose exact survey strategy is not yet decided.

The goal of this work then is to enable realistic estimates of feasible time delay measurement accuracy to be made with LSST. We will achieve this via a “Time Delay Challenge” (TDC) to the community. Independent, blind analysis of plausibly realistic LSST-like lightcurves will allow the accuracy of current time series analysis algorithms to be assessed which will lead to simple cosmographic forecasts for the anticipated LSST dataset. This work can be seen as a first step towards a full understanding of all systematic uncertainties present in the LSST strong lens dataset and will also provide valuable insight into the survey strategy needs of both Stage III and Stage IV time delay lens cosmography programs. Blind analysis, where the true value of the quantity being reconstructed is not known by the researchers, is a key tool for robustly testing the analysis procedure without biasing the results by continuing to look for errors until the right answer is reached, and then stopping.

This paper is organized as follows. In Section \ref{sec:light_curves} we describe the simulated data that we have generated for the challenge, including some of the broad details of observational and physical effects that may make extracting accurate time delays difficult, without giving away information that will not be observationally known during or after the LSST survey. Then, in Section \ref{sec:structure} we describe the structure of the challenge, how interested groups can access the mock light curves, and a minimal set of approximate cosmographic accuracy criteria that we will use to assess their performance.

\label{sec:light_curves}

The intensity as a function of time for a variable source is referred to as its light curve. For lensed sources, the light curves of images follow the intrinsic variability of the quasar source, but with individual time delays that are different for each image. Only the relative time delays between the images are measurable, since the unlensed quasar itself cannot be observed. Of course we do not actually measure a light curve, but rather discrete values of the intensity at different epochs. This sampling of the light curves, as well as the noise in the photometric measurement and external effects causing additional variations in the intensity, are all complications in estimating the time delays.

\label{sec:basics}

The history of the measurement of time delays in lens systems can be broadly split into three phases. In the first, the majority of the efforts were aimed at the first known lens system, Q0957+561 (Walsh et al., 1979). This system presented a particularly difficult situation for time-delay measurements, because the variability was smooth and relatively modest in amplitude, and because the time delay was long. This latter point meant that the annual season gaps when the source could not be observed at optical wavelengths complicated the analysis much more than they would have for systems with time delays of significantly less than one year. The value of the time delay remained controversial, with adherents of the “long” and “short” delays (e.g., Press et al., 1992; Press et al., 1992a; Pelt et al., 1996) in disagreement until a sharp event in the light curves resolved the issue (Kundic et al., 1995; Kundic et al., 1997). The second phase of time delay measurements began in the mid-1990s, by which time tens of lens systems were known, and small-scale but dedicated lens monitoring programs were conducted. With the larger number of systems, there were a number of lenses for which the time delays were more conducive to a focused monitoring program, i.e., systems with time delays on the order of 10–150 days. Furthermore, advances in image processing techniques, notably the image deconvolution method developed by Magain et al. (1998), allowed optical monitoring of systems in which the image separation was small compared to the seeing. The monitoring programs, conducted at both optical and radio wavelengths, produced robust time delay measurements (e.g., Lovell et al., 1998; Biggs et al., 1999; Fassnacht et al., 1999; Fassnacht et al., 2002; Burud et al., 2002; Burud et al., 2002a), even using fairly simple analysis methods such as cross-correlation, maximum likelihood, or the “dispersion” method introduced by Pelt et al. (1994); Pelt et al. (1996). The third and current phase, which began roughly in the mid-2000s, has involved large and systematic monitoring programs that have taken advantage of the increasing amount of time available on 1–2 m class telescopes. Examples include the SMARTS program (e.g., Kochanek et al., 2006), the Liverpool Telescope robotic monitoring program (e.g., Goicoechea et al., 2008), and the COSMOGRAIL program (e.g., Eigenbrod et al., 2005). These programs have shown that it is possible to take an industrial-scale approach to lens monitoring and produce good time delays (e.g., Tewes et al., 2013; Eulaers et al., 2013; Rathna Kumar et al., 2013). The next phase, which has already begun, will be lens monitoring from new large-scale surveys that include time-domain information such as the Dark Energy Survey, PanSTARRS, and LSST.

Measured time delays constrain the time delay distance \[D_{\Delta t} = \frac{d_l d_s}{d_{ls}}\] where \(d_l\) is the angular diameter distance between observer and lens, \(d_s\) between observer and source, and \(d_{ls}\) between lens and source. Note that because of spacetime curvature the lens-source distance is not the difference between the other two. The time delay distance will be inversely proportional to the Hubble constant \(H_0\), the current cosmic expansion rate that sets the scale of the universe, but the distances also involve the matter and dark energy densities, and the dark energy equation of state.

The accuracy of \(D_{\Delta t}\) derived from the data for a given lens system is dependent on both the mass model for that system as well as the precision measurement of the lensing observables. Typically, positions and fluxes (and occasionally shapes if the source is resolved) of the images can be obtained to sub-percent accuracy (citation not found: COSMOGRAIL), but time delay accuracies are usually on the order of days, or a few percent, for typical systems (see e.g., Tewes et al., 2013a). Measuring time delays requires continuous monitoring over months to years. However, wide area surveys only return to a given patch of sky every few nights, sources are only visible from a given point on the Earth for certain months of the year, and bad weather can lead to data gaps as well.

\label{sec:simulate}

Simulating the LSST observation of a multiply-imaged quasar involves four conceptual steps:

The quasar’s intrinsic light curve in a given optical band is generated at the accretion disk of the black hole in an active galactic nucleus (AGN);

The foreground lens galaxy causes multiple imaging, leading to 2 or 4 lensed light curves that are offset from the intrinsic light curve (and each other) in both amplitude (due to magnification), and time.

Time-dependent amplitude fluctuations due to microlensing by stars in the lens galaxy are generated

*on top of*(and*independently*for) each light curve.The delayed and microlensed light curves are sparsely, but simultaneously, “sampled” at the observational epochs, with the measurements adding noise.

In the next sections we describe the simulation of each of these steps in some detail during the generation of the challenge mock LSST light curve catalog.

\label{sec:car}

The optical light curves of quasars are generated by fluctuations in the brightness of the accretion disk with structure in the time series on the order of days [REFS]. Since these fluctuations are coherent, the implication is that the size of the accretion disk is roughly \(R_{\rm src} \sim 10^{16}\) cm (which will be important for the microlensing calculation in §\ref{sec:microlensing}). These fluctuations have been shown to be well described by a Continuous Auto Regressive (CAR) process. First described by [REFS], the CAR process is a damped random walk and is equivalent to a Gaussian Process in which the covariance between two points on the light curve decreases as a function of their temporal separation. Using data from the MACHO survey [REF], [REF] fit a CAR process to **(Greg: how many?)** r-band MACHO quasar light curves. The CAR process is given by [see Appendix in REFS], \[M(t) = e^{-t/\tau} M(0) + \bar{M}(1-e^{-t/\tau}) + \sigma\int_{0}^{t} e^{-(t-s)/\tau} dB(s),\] where \(M\) is the magnitude of an image, \(\tau\) is a characteristic timescale in days, \(\bar{M}\) is the mean magnitude of the light curve in the absence of fluctuations, and \(\sigma\) is the characteristic amplitude of the fluctuations in mag/day\(^{1/2}\). In this model, fluctuations are generated by the integral term where \(dB(s)\) is a normally distributed value with mean zero and standard deviation \(dt\). By fitting the above model to the data, [REF] generated a distribution of \(\tau\) and \(\sigma\) for the MACHO quasars; we show typical examples of the CAR process with reasonable values for those parameters in Figure \ref{fig:example_lcs}.

**(Greg: can you fill in the REFS please?)**

While the damped random walk process provides a good description of the data obtained so far, it is not yet clear whether it will remain so for longer baseline, higher cadence, or multi-filter light curves. The different emission regions of an AGN (different parts of the accretion disk, broad and narrow line clouds, etc.) are likely to vary in different ways, suggesting that sums of stochastic processes could provide more accurate descriptions (citation not found: KellyMultipleDRWs). These subcomponents would likely need parameters drawn from different distributions to the one above, and the correlations between the processes may need to be taken into account as well. Nevertheless, the success of the CAR model to date makes it a reasonable place to begin when simulating LSST-like AGN light curves.

\label{sec:time_delay_dist}

For a given lens system, the time delays between images can be as short as \(\sim\)1 day for close pairs of images to as long as \(\sim\)100s of days for images on opposite sides of the lensing galaxy. The magnitude of these time delays (as well as the other observables) depends on the redshifts of both the lens galaxy \(z_l\) and the source redshift \(z_s\), and therefore it is important to understand the expected distribution of those parameters in the LSST sample. (citation not found: OM10) generated a mock catalog of LSST lensed AGN based on plausible models for the source quasars and lens galaxies, and simple assumptions for the detectability of lensed quasars, including published 10\(\sigma\) limiting magnitude estimates, and the assumption that lenses will be detected if the third (second) brightest image for a given quad (double) is above this limit. This catalog provides a distribution of time delays that will be present in the LSST data which we can use to guide generation of mock light curves.

Figure \ref{fig:OM10dt} shows the \(\log_{10} \Delta t\) distributions for the OM10 double and quad sample. The distributions are roughly log-normal with means \(\sim\)10s of days and tails extending below 1 day for the quads, and above 100 days for the doubles. Lenses in both of these tails will have time delays that are difficult to measure, either because the cadence isn’t high enough, or because the observing seasons are not long enough. We expect some fraction of time delay measurements to fail catastrophically in these cases, but we also expect the catastrophe rate (and the robustness with which failure is reported) to vary with measurement algorithm.

\label{sec:microlensing}

**Greg: can you please fill in the references in this section please? Thanks!**

As noted in §\ref{sec:car}, the physical size of a quasar accretion disk is \(R_{\rm src} \sim 10^{15}\)-\(10^{16}\) cm, which, at cosmological distances, represents an angular size of \(\sim1\) \(\mu\)arcsecond (\(\mu\)as). In addition, the Einstein radius for a 1 \(M_{\odot}\) point mass at these distances is also \(\sim1\) \(\mu\)as, indicating that the stars in the lens galaxy will typically have an order unity (or more) effect on the brightnesses of the individual images. Given the relevant angular scales, this phenomenon is termed “microlensing”.

Microlensing has long been acknowledged as a significant source of potential error when estimating time delays from optical monitoring data (see e.g. Tewes et al., 2013, and references therein) due to the fact that the relative velocity between the source and lens leads to time dependent fluctuations that are independent between the images. **(GGD: put a microlensing figure here before we reference caustics.)** For caustic crossing events the relevant time scales are months to years, with smoother variations occurring over roughly decade timescales. As expected, the microlensing fluctuations are larger at bluer wavelengths, which correspond to smaller source sizes. The solution to measuring time delays in the presence of these fluctuations (which are uncorrelated between the quasar images) is to model the microlensing in each image individually at the same time as inferring the time delay (e.g. Kochanek, 2004; Tewes et al., 2013a).

We create mock microlensing signals in each quasar image light curve by calculating the magnification as the source moves behind a static stellar field. The parameters involved are the local convergence \(\kappa\) and shear \(\gamma\), the fraction of surface density in stars \(f_{\star}\), the source size \(R_{\rm src}\), and the relative velocity between the quasar and the lens galaxy \(v_{\rm rel}\). We also include a Salpeter mass function for the stars though the amplitude of the fluctuations depends predominantly on the mean mass (which we take to be 1 \(M_{\odot}\)).^{1}

For each lens in the OM10 catalog we assign an \(f_{\star}\) at each image position as follows. The OM10 catalog provides the velocity dispersion for a given lens which we use to estimate the i=band luminosity and effective radius of the galaxy by drawing from the Fundamental Plane [REF]. Assuming a standard (citation not found: deVaucoleurs1948) profile for the brightness distribution centered on the lens, an isothermal ellipsoid for the total mass distribution, and a mass-to-light ratio of **GGD: what???**, \(f_{\star}\) is the ratio of stellar mass density to total mass density at each image position.

Given \(\kappa\) and \(\gamma\) from the OM10 catalog and estimating \(f_{\star}\) as above, we generate magnification maps like the one shown in **(GGD: again, a figure would be nice** which represent the magnification of a point source as a function of position in the source plane. To use this map to generate temporal microlensing fluctuations we first smooth it by a Gaussian source profile with a size of **(GGD: Kai, what are you using???)** and then trace a linear path along a random direction in the map. This path is converted from source plane position to time units via \(v_{\rm rel}\) [REF]. The effect of having a finite source is to smooth out and reduce the amplitude of the microlensing fluctuations.

The microlensing code used in this work, MULES is freely available at aureplacedverbatimaa .↩

\label{sec:sampling}

The current state-of-the-art lens monitoring campaign, COSMOGRAIL, typically visits each of its target every few nights during each of several observing seasons each lasting many months. For example, Tewes et al. (2013) present 9 seasons of monitoring for the lensed quasar RXJ1131\(-\)1231 where the mean season length was 7.7 months (\(\pm 2\) weeks) and the median cadence was 3 days. These observations were taken in the same R-band filter, with considerable attention paid to photometric calibration and PSF estimation based on the surrounding star field. This data allowed Tewes et al. (2013) to measure a time delay of 91 days to 1.5% precision.

While this quality of measurement is possible for small samples (a few tens) of lenses, the larger sample of lensed quasars lying in the LSST survey footprint will all be monitored over the course of its ten year campaign, but at lower cadence and with shorter seasons. In the simplest possible “universal cadence” observing strategy, we would expect the mean cadence to be around 4 days between visits, in any filter, and with some variation with time as the scheduler responds to the needs of the various science programs and the changing conditions; the gaps between observations in the same filter will tend to be longer (citation not found: LSSTpaper) (citation not found: LSSTSciBook). The season length in this strategy is likely to be approximately 4 months (with variation among filters), in order to keep the telescope pointing at low airmass (see example in Figure \ref{fig:sample}). The primary impact of the shorter season length will be to make it hard to measure time delays of more than 100 days; the LSST universal cadence time delay lens sample would be biased towards delays shorter than this.

The universal cadence strategy may not turn out to be optimal, and we can explore various LSST observing strategies by simulating light curves with a range of cadences and season lengths. The shorter cadences and longer seasons are closer to those obtained by COSMOGRAIL and blind analysis of those datasets will provide understanding of the accuracy available to that program as its lens sample increases. We note that only if all filters’ light curves can be fitted simultaneously with a model for the multi-filter variability would the maximum, any-filter cadence be fully exploited – but that even if this is not possible, the dithered nature of the different filters’ light curves should still allow a time resolution *approaching* that of the any-filter cadence.

The remaining variables in the mock lightcurve generation pertain to the photometric uncertainties applied to the observed fluxes. Tewes et al. (2013)a provide a summary of possible sources of uncertainty and error in the photometric measurements, and we follow this in generating lightcurves with realistic uncertainties, including in the accuracy of the error reporting. The OM10 mock lens sample contains a variety of quasar image brightnesses, allowing us to investigate time delay accuracy as a function of signal to noise, or for LSST, source magnitude.

\label{sec:structure}

This section outlines the two initial steps of the challenge, gives the instructions for participation and timeline, and defines the goal of the challenge and the criteria for evaluation.

\label{ssec:steps}

The initial challenge consists of two steps, hereafter time-delay challenge 0 and 1 (TDC0 and TDC1). Each time delay challenge is organized as a ladder with a number of simulated light curves at each rung. The rungs are intended to represent increasing levels of difficulty and realism within each challenge. The simulated light curves were created by the “evil team” (authors GD, CDF, PJM, TT, NR, and KL). All the details about the light curves, including input parameters, noise properties etc, will only be revealed to the teams participating in the challenge (hereafter “good teams”) after the closing of the challenge.

TDC0 consists of a small number of simulated light curves with fairly basic properties in terms of noise, sampling season, cadence. It is intended to serve as a validation tool before embarking on TDC1. The evil team expects that state of the art algorithms should be able to process TDC0 with minimal computing time and recover the input time delays within the estimated uncertainties. TDC0 also provides a means to perform basic debugging and test input and output formats for the challenge. Good teams are required to successfully meet the TDC0 criteria before embarking on TDC1. The outcome of TDC0 will be a pass/fail response granting access to TDC1.

TDC1 is the actual challenge. It consists of thousands of sets of simulated light curves, also arranged in rungs of increasing difficulty and realism. The large data volume is chosen to simulate the demands of an LSST like experiment, but also to be able to detect biases in the algorithms at the subpercent level. The evil team expects that processing the TDC1 dataset will be challenging with current algorithms in terms of computing resources. TDC1 thus represents a test of the accuracy of the algorithms but also of their efficiency. Incomplete submissions will be accepted, although the number of processed light curves is one of the metrics by which algorithms are evaluated, as described below.

The mock data generated for the highest rungs of the initial challenge ladders TDC0 and TDC1 are as realistic as our current simulation technology allows, but lower rungs are somewhat simplified. This design is based on the successful weak lensing STEP (citation not found: STEP1) (citation not found: STEP2) and GREAT (citation not found: GREAT08) (citation not found: GREAT10Stars) (citation not found: GREAT10Galaxies) shape estimation challenges, where the former tried to be as realistic as possible, while the latter focused on specific aspects of the problem. Still, following a successful outcome of TDC0 and TDC1 we anticipate in the future further increasing the complexity of the simulations so as to stimulate gradual improvements in the algorithms over the remainder of this decade. Of course our approach of testing on simulated data is very complementary to tests on real data. The former allow one to test blindly for accuracy but they are valid only insofar as the simulations are realistic, while the latter provide a valuable test of consistency on actual data, including all the unknown unknowns.

\label{ssec:instruction}

Instructions for how to access the simulated light curves in the time delay challenge are given at this website http://darkenergysciencecollaboration.github.io/SLTimeDelayChallenge/. In short, participation in the challenge requires the following steps.

Every prospective good team is invited to download the TDC0 light curves and analyze them. Upon completion of the analysis, they will submit their time delay estimates, together with their estimated 68% uncertainties, to the challenge organisers for analysis. The simulation team will calculate a minimum of four standard metrics given this set of estimated time delays \(\tilde{\Delta t}\) and uncertainties \(\sigma\). The first one is efficiency, quantified as the fraction of light curves \(f\) for which an estimate is obtained. Of course, this is not a sufficient requirement for success, as the estimate should also be accurate and have correct uncertainties. There might be cases when the data are ambiguous (for example in case the time delay falls into season gaps) and for those some methods will indicate failure while others will estimate very large uncertainties.

Therefore we need to introduce a second metric to evaluate how realistic is the error estimate. This is achieved with the second metric: the goodness of fit of the estimates, quantified by the standard reduced \(\chi^2\): \[\chi^2=\frac{1}{fN}\sum_i \left(\frac{\tilde{\Delta t}_i - \Delta t_i}{\sigma_i}\right)^2.\]

The third metric is the precision of the estimator, quantified by the average relative uncertainty per lens: \[P=\frac{1}{fN}\sum_i \left(\frac{\sigma_i}{|\Delta t_i|}\right).\]

The fourth is the accuracy of the estimator, quantified by the average fractional residual per lens \[A=\frac{1}{fN} \sum_i \frac{\tilde{\Delta t}_i - \Delta t_i}{|\Delta t_i|}.\]

**PJM: I don’t think we can take the absolute value of the residual, otherwise we won’t average down the statistical (non-systematic) fluctuations in \(\Delta t_i\). Do you agree with my corrected version?**

The final metric of our minimal set is given by the number of systems for which a cosmologically useful estimate is obtained. This fraction will depend not just on the algorithms but also on the actual time-delay and quality of the simulated data. The quantity \(g\) is defined as the fraction of objects that satisfies the individual time delay precision condition \(\sigma_i/|\tilde{\Delta t}_i|<0.05\).

The initial function of these metrics is to define a minimal performance threshold that must be passed, in order to guarantee meaningful results in TDC1. To pass TDC0, an analysis team’s results must satisfy the following criteria.

\(f>0.3\)

\(0.5<\chi^2<2\)

\(P<15\%\)

\(A<15\%\)

**[EL: Why the lower bound on \(\chi^2\)? If Good Team fits extremely accurately, but puts an extra “systematic” error in to account for uncertainties, why penalize? This actually happens with our DRW fits where we sometimes get errors of 0.04 days but we never believe this accuracy and might inflate it to 0.4 days. This should be fine, especially seeing my note below about only counting in \(f\) those systems with apparent precision within 5%.]** [**TT: I think that the lower bound on \(\chi^2\) is needed because overestimating errors is not good either. If we we think errors are too large we might overlook some valuable system.**]

A failure rate of 70% is something like the borderline of acceptability for LSST (given the total number of lenses expected), and so can be used to define the efficiency threshold. The TDC0 lenses will be selected to span the range of possible time delays, rather than being sampled from the OM10 distribution, and so we therefore expect a higher rate of catastrophic failure at this stage than in TDC1: 30% successes is a minimal bar to clear.

**[EL: see my previous remarks about not wanting \(f=1\) but rather that \(f\) should take the value of the fraction of systems that could legitimately be fit given season coverage. One should penalize \(f\) greater than this value. Also, Alireza and I use ratings (gold, silver, brass) to indicate a degree of confidence; this is useful since systems will need spectroscopic follow-up and we shouldn’t waste telescope time on brass systems. So a low \(f\) is not automatically bad. One could allow Good Teams to submit one entry for their gold+silver systems, say, and one entry for all their systems, and not penalize the former due to low \(f\) as long as \(fN>100\) when \(N\ge1000\), say, if that’s what we think is realistic for followup.]** [**TT: that’s a good point and a matter of philosophy to some extent. In the scenario you describe one could imagine that failure means a very large uncertainty, so that your brass systems would have very large uncertainties and not be used. I am fine lowering the threshold considering that some systems might indeed not be measurable if there are too many gaps. So I lowered it to \(f>0.3\)**].

The factor of two half-ranges in reduced \(\chi^2\) correspond approximately to fits that include approximately 95% of the \(\chi^2\) probability distribution when \(N=8\), i.e. the number of pairs in every rung of TDC0: fits outside this range likely have problems with the time delay estimates, or the estimation of their uncertainties, or both.

**[EL: I didn’t follow this. If fits are \(2\sigma\) away then each contributes \(\chi^2=4\) not 2.] TT: it’s 2-\(\sigma\) on the distribution of \(\chi^2\) given that you are summing over 8 estimates per ladder. I hope this version is clearer.**

Requiring an average precision and accuracy of better than 15% is a further minimal bar to clear; in Section \ref{structure} we will describe the targets for TDC1.

Repeat submissions will be accepted as teams iterate their analyses on the lower rungs of TDC0. The final rung will remain blinded until after the nominal deadline of December 1 2013, when initial qualifiers for TDC1 will be announced and the TDC1 data released. Late submission will be accepted, but the teams will then have less time to carry out TDC1.

Good teams that successfully pass TDC0 will given access to the full TDC1. As in TDC0 the good teams will estimate time delays and uncertainties and provide the answers to the evil team via a suitable web interface (to be found at the challenge website). The evil team will compute the metrics described above. The results will not be revealed until the end of the challenge in order to maintain blindness.

The deadline for TDC1 is 1st July 2014, i.e. six months after TDC0. Multiple submissions are accepted from each team in order to allow for correction of bugs, and for different algorithms. However, only the most recent submission for each algorithm will be considered in order to avoid favoring teams with multiple submissions. Late submissions will be accepted and included in the final publication if received in time but will be flagged as such.

Initially this first paper will only be posted on the arxiv as a means to open the challenge. After the deadline, the full details of the TDC0 and TDC1 will be revealed by adding an appendix to this paper. At the same time, the results of TDC1 will be described in the second paper of this series, including as co-authors all the members of the good teams who participated in the challenge. The two papers will be submitted concurrently so as to allow the referee to evaluate the entire process.

The overall goal of TDC0 and TDC1 is to carry out a blind test of current state of the art time-delay estimation algorithms in order to quantify the available accuracy. Criteria for success depend on the time-horizon. At present, time-delay cosmology is limited by the number of lenses with measured light curves and by the modeling uncertainties which are of order 5% per system . Furthermore, distance measurements are currently in the range of accuracy of 3%. Therefore, any method that can currently provide time-delays with realistic uncertainties (\(\chi^2<1.5\)) for the majority (\(f>0.5\)) of light curves with accuracy \(A\) and precision \(P\) better than 3% can be considered a competitive method.

In the longer run, with LSST in mind, a desirable goal is to maintain an average precision of \(P<3\%\) per lens, but to improve the average accuracy to \(A < 0.2\%\) per lens in order for the sub-percent precision cosmological parameter estimates not to be limited by time-delay measurement systematics. For \(N=1000\), the 95% goodness of fit requirement becomes \(\chi^2 < 1.09 fN\), while keeping \(f>0.5\). Testing for such extreme accuracy requires a large sample of lenses: TDC1 will contain several thousand simulated systems to enable such tests. **[EL: I didn’t follow either argument. The random component of the accuracy should be of the same order as the precision; only the systematic component (which may not be addressed by these TDCs) should be less. We certainly wouldn’t count strong lens distances a failure if it achieved 1% distance accuracy averaged over all systems in a redshift bin. Systematics are likely to be dominated by lens or line of sight mass modeling rather than time delay estimation.] TT: The point is to make sure that the methods are not biased, thus the requirement on A.**

A. D. Biggs, I. W. A. Browne, P. Helbig, L. V. E. Koopmans, P. N. Wilkinson, R. A. Perley. Time delay for the gravitational lens system B0218+357.

**304**, 349-358 (1999). LinkI. Burud, F. Courbin, P. Magain, C. Lidman, D. Hutsemékers, J.-P. Kneib, J. Hjorth, J. Brewer, E. Pompei, L. Germany, J. Pritchard, A. O. Jaunsen, G. Letawe, G. Meylan. An optical time-delay for the lensed BAL quasar HE 2149-2745.

**383**, 71-81 (2002). LinkI. Burud, J. Hjorth, F. Courbin, J. G. Cohen, P. Magain, A. O. Jaunsen, A. A. Kaas, C. Faure, G. Letawe. Time delay and lens redshift for the doubly imaged BAL quasar SBS 1520+530.

**391**, 481-486 (2002). LinkA. Eigenbrod, F. Courbin, C. Vuissoz, G. Meylan, P. Saha, S. Dye. COSMOGRAIL: The COSmological MOnitoring of GRAvItational Lenses. I. How to sample the light curves of gravitationally lensed quasars to measure accurate time delays.

**436**, 25-35 (2005). LinkE. Eulaers, M. Tewes, P. Magain, F. Courbin, I. Asfandiyarov, S. Ehgamberdiev, S. Rathna Kumar, C. S. Stalin, T. P. Prabhu, G. Meylan, H. Van Winckel. COSMOGRAIL: the COSmological MOnitoring of GRAvItational Lenses. XII. Time delays of the doubly lensed quasars SDSS J1206+4332 and HS 2209+1914.

**553**, A121 (2013). LinkC. D. Fassnacht, T. J. Pearson, A. C. S. Readhead, I. W. A. Browne, L. V. E. Koopmans, S. T. Myers, P. N. Wilkinson. A Determination of H\(_{0}\) with the CLASS Gravitational Lens B1608+656. I. Time Delay Measurements with the VLA.

**527**, 498-512 (1999). LinkC. D. Fassnacht, E. Xanthopoulos, L. V. E. Koopmans, D. Rusin. A Determination of H\(_{0}\) with the CLASS Gravitational Lens B1608+656. III. A Significant Improvement in the Precision of the Time Delay Measurements.

**581**, 823-835 (2002). LinkL. J. Goicoechea, V. N. Shalyapin, E. Koptelova, R. Gil-Merino, A. P. Zheleznyak, A. Ullán. First robotic monitoring of a lensed quasar: Intrinsic variability of SBS 0909+532.

**13**, 182-193 (2008). LinkC. R. Keeton, L. A. Moustakas. A New Channel for Detecting Dark Matter Substructure in Galaxies: Gravitational Lens Time Delays.

**699**, 1720-1731 (2009). LinkC. S. Kochanek, N. D. Morgan, E. E. Falco, B. A. McLeod, J. N. Winn, J. Dembicky, B. Ketzeback. The Time Delays of Gravitational Lens HE 0435-1223: An Early-Type Galaxy with a Rising Rotation Curve.

**640**, 47-61 (2006). LinkC. S. Kochanek. Quantitative Interpretation of Quasar Microlensing Light Curves.

**605**, 58-77 (2004). LinkT. Kundic, W. N. Colley, III Gott, S. Malhotra, U.-L. Pen, J. E. Rhoads, K. Z. Stanek, E. L. Turner, J. Wambsganss. An Event in the Light Curve of 0957+561A and Prediction of the 1996 Image B Light Curve.

**455**, L5 (1995). LinkT. Kundic, E. L. Turner, W. N. Colley, III Gott, J. E. Rhoads, Y. Wang, L. E. Bergeron, K. A. Gloria, D. C. Long, S. Malhotra, J. Wambsganss. A Robust Determination of the Time Delay in 0957+561A, B and a Measurement of the Global Value of Hubble’s Constant.

**482**, 75 (1997). LinkJ. E. J. Lovell, D. L. Jauncey, J. E. Reynolds, M. H. Wieringa, E. A. King, A. K. Tzioumis, P. M. McCulloch, P. G. Edwards. The Time Delay in the Gravitational Lens PKS 1830-211.

**508**, L51-L54 (1998). LinkP. Magain, F. Courbin, S. Sohy. Deconvolution with Correct Sampling.

**494**, 472 (1998). LinkJ. Pelt, W. Hoff, R. Kayser, S. Refsdal, T. Schramm. Time delay controversy on QSO 0957+561 not yet decided.

**286**, 775-785 (1994).J. Pelt, R. Kayser, S. Refsdal, T. Schramm. The light curve and the time delay of QSO 0957+561..

**305**, 97 (1996).W. H. Press, G. B. Rybicki, J. N. Hewitt. The time delay of gravitational lens 0957 + 561. I - Methodology and analysis of optical photometric data. II - Analysis of radio data and combined optical-radio analysis.

**385**, 404-420 (1992). LinkW. H. Press, G. B. Rybicki, J. N. Hewitt. The Time Delay of Gravitational Lens 0957+561. II. Analysis of Radio Data and Combined Optical-Radio Analysis.

**385**, 416 (1992). LinkS. Rathna Kumar, M. Tewes, C. S. Stalin, F. Courbin, I. Asfandiyarov, G. Meylan, E. Eulaers, T. P. Prabhu, P. Magain, H. Van Winckel, S. Ehgamberdiev. COSMOGRAIL: the COSmological MOnitoring of GRAvItational Lenses XIV. Time delay of the doubly lensed quasar SDSS J1001+5027.

*ArXiv e-prints*(2013).S. Refsdal. On the possibility of determining Hubble’s parameter and the masses of galaxies from the gravitational lens effect.

**128**, 307 (1964).S. H. Suyu, M. W. Auger, S. Hilbert, P. J. Marshall, M. Tewes, T. Treu, C. D. Fassnacht, L. V. E. Koopmans, D. Sluse, R. D. Blandford, F. Courbin, G. Meylan. Two Accurate Time-delay Distances from Strong Lensing: Implications for Cosmology.

**766**, 70 (2013). LinkM. Tewes, F. Courbin, G. Meylan. COSMOGRAIL: the COSmological MOnitoring of GRAvItational Lenses. XI. Techniques for time delay measurement in presence of microlensing.

**553**, A120 (2013). LinkM. Tewes, F. Courbin, G. Meylan, C. S. Kochanek, E. Eulaers, N. Cantale, A. M. Mosquera, P. Magain, H. Van Winckel, D. Sluse, G. Cataldi, D. Vörös, S. Dye. COSMOGRAIL: the COSmological MOnitoring of GRAvItational Lenses. XIII. Time delays and 9-yr optical monitoring of the lensed quasar RX J1131-1231.

**556**, A22 (2013). LinkD. Walsh, R. F. Carswell, R. J. Weymann. 0957 + 561 A, B - Twin quasistellar objects or gravitational lens.

**279**, 381-384 (1979). Link