Eric Linder edited Challenge Structure.tex  over 10 years ago

Commit id: fb81f009591586b09ee801b98172a878f2958d2c

deletions | additions      

       

\subsection{Steps of the Challenge}  \label{ssec:steps}  The initial challenge consists of two steps, hereafter time-delay challenge 0 and 1 (TDC0 and TDC1). Each time delay challenge is organized as a ladder with a number of simulated light curves at each rung. The rungs are intended to represent increasing levels of difficulty and realism within each challenge. The simulated light curves were created by the "evil team" (GD, CDF, PJM, TT). All the details about the light curves, including input parameters, noise properties etc, will only  be revealed to the teams participating in the challenge (hereafter "good teams")only  after the closing of the challenge.One good team (Linder and Hojjati) beta-tested the first rung of TDC0 as a courtesy but they were not made aware of any of the input parameters of the simulated light curves except for the input time-delays.  TDC0 consists of a small number of simulated light curves with fairly basic properties in terms of noise, sampling, sampling season,  cadence. It is intended to serve as a validation tool before embarking in on  TDC1. The evil team expects that state of the art algorithms should be able to process TDC0 with minimal computing time and recover the input time delays within the estimated uncertainties. TDC0 also provide provides  a means to perform basic debugging and test input and output formats for the challenge. Good teams are required to successfully meet the TDC0 criteria before embarking in on  TDC1. The outcome of TDC0 will be a pass/fail response granting access to TDC1. TDC1 is the actual challenge. It consists of thousands of sets of simulated light curves, also arranged in rungs of increasing difficulty and realism. The large data volume is chosen to simulate the demands of an LSST like experiment, but also to be able to detect biases in the algorithms at the subpercent level. The evil team expects that processing the TDC1 dataset will be challenging with current algorithms in terms of computing resources. TDC1 thus represents a test of the accuracy of the algorithms but also of their efficiency. Incomplete submissions will be accepted, although the number of processed light curves is one of the metrics by which algorithms are evaluated, as described below.   As in other fields of astronomy (cite STEP, GREAT08, GREAT10 etc) the initial challenges TDC0 and TDC1 are relatively idealized. After the successful outcome of this first challenge we expect in the future to increase the complexity of the simulations so as to stimulate gradual improvements in the algorithms over the remainder of this decade. {\bf [EL: Yet some teams have, and other could, test on real data. This leads to consistency, not accuracy, and is not blind, but should we mention this?]}  \subsection{Instructions for participation, timeline, and ranking criteria}  \label{ssec:instruction} 

\subsubsection{TDC0}  Every prospective good team is invited to download the $N$ TDC0 pairs of light curves and analyze them. Upon completion of the analysis, the time delays estimates together with the estimated 68\% uncertainties will be uploaded to a designated web site. The simulation team will calculate four standard metrics given a set of estimated time delays $\tilde{\Delta t}$ and uncertainties $\sigma$. The first one is robustness, quantified as the fraction of light curves $f$ for which an estimate is obtained. The second one {\bf [EL: Merely having an estimate  is not  the goodness of fit key point; it has to be both a good and an appropriate estimate. As Alireza pointed out, for some cases (where the time delay falls into season gaps), the success  of the estimates, quantified by method requires it to fail in  the standard $\chi^2$ fit - giving an estimate, even if it turns out to be close, should be regarded as incorrect.]}  The second one is the goodness of fit of the estimates, quantified by the standard $\chi^2$  \begin{equation}  \chi^2=\sum_i \left(\frac{\tilde{\Delta t}_i - \Delta t_i}{\sigma_i}\right)^2  \end{equation} 

The third metric is the precision of the estimator, quantified by the average relative uncertainties  \begin{equation}  P=\frac{1}{fN}\sum_i \left(\frac{\sigma_i}{\Delta t_i}\right) \left(\frac{\sigma_i}{|\Delta t_i|}\right)  \end{equation} {\bf [EL: I put in an absolute value - if you always define $\Delta t>0$ this gives away information that should be blinded.]}  The fourth is the accuracy of the estimator, quantified by the average fractional residuals  \begin{equation}  A=\frac{1}{fN}\left|\sum_i \left(\frac{\tilde{\Delta A=\frac{1}{fN} \sum_i \left|\frac{\tilde{\Delta  t}_i - \Delta t_i}{\Delta t_i}\right)\right| t_i}\right|  \end{equation}  The initial function of these metrics is to define a minimal performance threshold that must be passed, in order to guarantee meaningful results in TDC1. To pass TDC0, an analysis team's results must satisfy the following criteria.  

\item $P<0.15$  \item $A<0.15$  \end{enumerate}  {\bf [EL: Why the lower bound on $\chi^2$? If Good fits extremely accurately, but puts an extra "systematic" error in to account for uncertainties, why penalize? This actually happens with our DRW fits where we sometimes get errors of 0.04 days but we never believe this accuracy and might inflate it to 0.4 days. This should be fine, especially seeing my note below about only counting in $f$ those systems with apparent precision within 5%.]}  A failure rate of 50\% is something like the borderline of acceptability for LSST, and so can be used to define the robustness threshold. The TDC0 lenses will be selected to span the range of possible time delays, rather than being sampled from the OM10 distribution, and so we therefore expect a higher rate of catastrophic failure at this stage than in TDC1: 50\% is a minimal bar to clear. {\bf [EL: see my previous remarks about not wanting $f=1$ but rather $f$ should take the value of the fraction of systems that could legitimately be fit given season coverage. One should penalize $f$ greater than this value. Also, Alireza and I use ratings (gold, silver, brass) to indicate a degree of confidence; this is useful since systems will need spectroscopic followup and we shouldn't waste telescope time on brass systems. So a low $f$ is not automatically bad. One could allow Good Teams to submit one entry for their gold+silver systems, say, and one entry for all their systems, and not penalize the former due to low $f$ as long as $fN>100$ when $N\ge1000$, say, if that's what we think is realistic for followup.]}  The factor of two in reduced chi-squared corresponds approximately to fits that are two-sigma away from being acceptable when $N=8$: such fits likely have problems with the time delay estimates, or the estimation of their uncertainties, or both. {\bf [EL: I didn't follow this. If fits are $2\sigma$ away then each contributes $\chi^2=4$ not 2.]}  Requiring precision and accuracy of better than 15\% is a further minimal bar to clear; in \S~\ref{structure} we will describe the targets for TDC1. {\bf [EL: We actually care much more about the "apparently precise" systems than about all the systems. For time delays of 1-5 days, it will be almost impossible with LSST cadence to get 5% precision. The cosmological leverage will then all come from long time delays of 30-100 days. So maybe we should specifically redefine $f$ as the fraction of systems fit to apparent precision $sigma_i/\tilde{\Delta t_i}<0.05$ (note the Good Team measures both numerator and denominator so it stays blind). In this case $f$ will generally be much less than 1, but should roughly represent the fraction of systems with time delays between 30 days and 120 days (or the season length). If you want, you could put in some trick systems where the time delay takes it to the next season, i.e.\ greater than 240 days.]}  Repeat submissions will be accepted as teams iterate their analyses on the lower rungs of TDC0. The final rung will remain blinded until after the nominal deadline of 1 November 2013, when initial qualifiers for TDC1 will be announced and the TDC1 data released. Late submission will be accepted, but the teams will then have less time to carry out TDC1. 

\begin{equation}  C=\left|\frac{\chi^2}{fN}-1\right|\frac{AP}{f}.  \end{equation} {\bf [EL: Not sure where this combination comes from. A standard statistical measure is the mean squared error or risk: $R=\sum_i\sqrt{\sigma_i^2+(\tilde{\Delta t_i}-\Delta t_i)^2}=\sum_i \sigma_i\sqrt{1+\chi^2_i}$.]}  The results will not be revealed until the end of the challenge in order to maintain blindness.  The deadline for TDC1 is 1st May 2014, i.e. six months after TDC0. Multiple submissions are accepted from each team in order to allow for correction of bugs, and for different algorithms. However, only the most recent submission for each algorithm will be considered in order to avoid favoring teams with multiple submissions. Late submissions will be accepted and included in the final publication if received in time but will be flagged as such. 

The overall goal of TDC0 and TDC1 is to carry out a blind test of current state of the art time-delay estimation algorithms in order to quantify the available accuracy. Criteria for success depend on the time-horizon. As discussed in the appendix, at present, time-delay cosmology is limited by the number of lenses with measured light curves and by the modeling uncertainties which are of order 5\% per system. Furthermore, distance measurements are currently in the range of accuracy of 3\%. Therefore, any method that can provide time-delays with realistic uncertainties ($\chi^2<1.5fN$) for the majority ($f>0.5$) of light curves with accuracy $A$ and precision $P$ better than 3\% can be considered a viable method.  In the longer run, with LSST in mind, a desirable goal is to maintain $P<3\%$, but to improve the accuracy to $A < 0.2\%$ in order for the cosmological parameter estimates not to be limited by time-delay measurements systematics. For $N=1000$, the 2-sigma goodness of fit requirement becomes $\chi^2 < 1.09 fN$, while keeping $f>0.5$. Testing for such extreme accuracy requires a large sample of lenses: TDC1 will contain several thousand simulated systems to enable such tests. {\bf [EL: I didn't follow either argument. The random component of the accuracy should be of the same order as the precision; only the systematic component (which may not be addressed by these TDCs) should be less. We certainly wouldn't count strong lens distances a failure if it achieved 1\% distance accuracy averaged over all systems in a redshift bin. Systematics are likely to be dominated by lens or line of sight mass modeling rather than time delay estimation.]}