Authorea

Tommaso Treu edited Challenge Structure.tex over 10 years ago

Commit id: e36e586ec0009794be1d13d9867368e70f35abfb

deletions | additions

TDC1 is the actual challenge. It consists of thousands of sets of simulated light curves, also arranged in rungs of increasing difficulty and realism. The large data volume is chosen to simulate the demands of an LSST like experiment, but also to be able to detect biases in the algorithms at the subpercent level. The evil team expects that processing the TDC1 dataset will be challenging with current algorithms in terms of computing resources. TDC1 thus represents a test of the accuracy of the algorithms but also of their efficiency. Incomplete submissions will be accepted, although the number of processed light curves is one of the metrics by which algorithms are evaluated, as described below. As in other fields of astronomy (cite STEP, GREAT08, GREAT10 etc) the initial challenges TDC0 and TDC1 are relatively idealized. After the successful outcome of this first challenge we expect in the future to increase the complexity of the simulations so as to stimulate gradual improvements in the algorithms over the remainder of this decade. {\bf [EL: Yet some teams have, and others could, test Of course our approach of testing on simulated data is very complementary to tests on real data. This leads The former allow one to consistency, not accuracy, and is not blind, test blindly for accuracy but should we mention this?]} they are valid only insofar as the simulations are realistic, while the latter provide a valuable test of consistency on actual data, including all the unknown unknowns. \subsection{Instructions for participation, timeline, and ranking criteria} \label{ssec:instruction}

\subsubsection{TDC0} Every prospective good team is invited to download the $N$ TDC0 pairs of light curves and analyze them. Upon completion of the analysis, the time delays estimates together with the estimated 68\% uncertainties will be uploaded to a designated web site. The simulation team will calculate four standard metrics given a set of estimated time delays $\tilde{\Delta t}$ and uncertainties $\sigma$. The first one is robustness, efficiency, quantified as the fraction of light curves $f$ for which an estimate is obtained. {\bf [EL: Merely having an estimate Of course, this is not a sufficient requirement as the key point; it has to estimate should also be both a good accurate and an appropriate estimate. As Alireza pointed out, for some cases (where have correct uncertainties. There might be instancies when the data are ambiguous (for example in case the time delay falls into season gaps), the success of the method requires it to fail in the fit - giving an estimate, even if it turns out to be close, should be regarded as incorrect.]} gaps) and for those some methods will indicate failure while others will estimate very large uncertainties. The Therefore we need to introduce a second one metric to evaluate how realistic is the error estimate. This is achieved with the second metric is the goodness of fit of the estimates, quantified by the standard $\chi^2$ \begin{equation} \chi^2=\sum_i \left(\frac{\tilde{\Delta t}_i - \Delta t_i}{\sigma_i}\right)^2 \end{equation}

\begin{equation} P=\frac{1}{fN}\sum_i \left(\frac{\sigma_i}{|\Delta t_i|}\right) \end{equation} {\bf [EL: I put in an absolute value - if you always define $\Delta t>0$ this gives away information that should be blinded.]} The fourth is the accuracy of the estimator, quantified by the average fractional residuals

\item $A<0.15$ \end{enumerate} {\bf [EL: Why the lower bound on $\chi^2$? If Good Team fits extremely accurately, but puts an extra "systematic" error in to account for uncertainties, why penalize? This actually happens with our DRW fits where we sometimes get errors of 0.04 days but we never believe this accuracy and might inflate it to 0.4 days. This should be fine, especially seeing my note below about only counting in $f$ those systems with apparent precision within 5\%.]} A failure rate of 50\% is something like the borderline of acceptability for LSST, and so can be used to define the robustness threshold. The TDC0 lenses will be selected to span the range of possible time delays, rather than being sampled from the OM10 distribution, and so we therefore expect a higher rate of catastrophic failure at this stage than in TDC1: 50\% is a minimal bar to clear. {\bf [EL: see my previous remarks about not wanting $f=1$ but rather that $f$ should take the value of the fraction of systems that could legitimately be fit given season coverage. One should penalize $f$ greater than this value. Also, Alireza and I use ratings (gold, silver, brass) to indicate a degree of confidence; this is useful since systems will need spectroscopic followup and we shouldn't waste telescope time on brass systems. So a low $f$ is not automatically bad. One could allow Good Teams to submit one entry for their gold+silver systems, say, and one entry for all their systems, and not penalize the former due to low $f$ as long as $fN>100$ when $N\ge1000$, say, if that's what we think is realistic for followup.]} The factor of two in reduced chi-squared corresponds approximately to fits that are two-sigma away from being acceptable when $N=8$: such fits likely have problems with the time delay estimates, or the estimation of their uncertainties, or both.