Authorea

Tommaso Treu edited Challenge Structure.tex over 10 years ago

Commit id: 154665ba64661fe75264c587f313110878bfe541

deletions | additions

The initial function of these metrics is to define a minimal performance threshold that must be passed, in order to guarantee meaningful results in TDC1. To pass TDC0, an analysis team's results must satisfy the following criteria. \begin{enumerate} \item $f>0.5$ $f>0.3$ \item $0.5<\chi^2/fN<2$ \item $P<0.15$ \item $A<0.15$ \end{enumerate} {\bf [EL: Why the lower bound on $\chi^2$? If Good Team fits extremely accurately, but puts an extra "systematic" error in to account for uncertainties, why penalize? This actually happens with our DRW fits where we sometimes get errors of 0.04 days but we never believe this accuracy and might inflate it to 0.4 days. This should be fine, especially seeing my note below about only counting in $f$ those systems with apparent precision within 5\%.] \bf [{\bf TT: I think that the lower bound on \chi^2 $\chi^2$ is needed because overestimating errors is not good either. If we we think errors are too large we might overlook some valuable system.} system.}] A failure rate of 50\% is something like the borderline of acceptability for LSST, and so can be used to define the robustness threshold. The TDC0 lenses will be selected to span the range of possible time delays, rather than being sampled from the OM10 distribution, and so we therefore expect a higher rate of catastrophic failure at this stage than in TDC1: 50\% is a minimal bar to clear. {\bf [EL: see my previous remarks about not wanting $f=1$ but rather that $f$ should take the value of the fraction of systems that could legitimately be fit given season coverage. One should penalize $f$ greater than this value. Also, Alireza and I use ratings (gold, silver, brass) to indicate a degree of confidence; this is useful since systems will need spectroscopic followup and we shouldn't waste telescope time on brass systems. So a low $f$ is not automatically bad. One could allow Good Teams to submit one entry for their gold+silver systems, say, and one entry for all their systems, and not penalize the former due to low $f$ as long as $fN>100$ when $N\ge1000$, say, if that's what we think is realistic for followup.]} [{\bf TT: that's a good point and a matter of philosophy to some extent. In the scenario you describe one could imagine that failure means a very large uncertainty, so that your brass systems would have very large uncertainties and not be used. I am fine lowering the threshold considering that some systems might indeed not be measurable if there are too many gaps. So I lowered it to $f>0.3$} The factor of two in reduced chi-squared $\chi^2$ corresponds approximately to fits that are two-sigma away from being acceptable include approximately 95% of the $\chi^2$ probability distribution when $N=8$: such fits likely have problems with the time delay estimates, or the estimation of their uncertainties, or both. {\bf [EL: I didn't follow this. If fits are $2\sigma$ away then each contributes $\chi^2=4$ not 2.]} 2.] TT: it's 2-\sigma on the distribution of \chi^2 given N=8 degrees of freedom. I hope this version is clearer.} Requiring precision and accuracy of better than 15\% is a further minimal bar to clear; in \S~\ref{structure} we will describe the targets for TDC1. {\bf [EL: We actually care much more about the "apparently precise" systems than about all the systems. For time delays of 1-5 days, it will be almost impossible with LSST cadence to get 5\% precision. The cosmological leverage will then all come from long time delays of 30-100 days. So maybe we should specifically redefine $f$ as the fraction of systems fit to apparent precision $sigma_i/\tilde{\Delta t}_i<0.05$ (note the Good Team measures both numerator and denominator so it stays blind). In this case $f$ will generally be much less than 1, but should roughly represent the fraction of systems with time delays between 30 days and 120 days (or the season length). If you want, you could put in some trick systems where the time delay takes it to the next season, i.e.\ greater than 240 days.]} Repeat submissions will be accepted as teams iterate their analyses on the lower rungs of TDC0. The final rung will remain blinded until after the nominal deadline of 1 November December 2013, when initial qualifiers for TDC1 will be announced and the TDC1 data released. Late submission will be accepted, but the teams will then have less time to carry out TDC1. \subsubsection{TDC1} Good teams that successfully pass TDC0 will given access to the full TDC1. As in TDC0 the good teams will estimate time delays and uncertainties and provide the answers to the evil team via a suitable web interface (to be found at the challenge website). The evil team will compute the metrics described above.There is no unique way to define a single summary metric that balances all four different requirements. We choose to define one for the sake of completeness although this should be used as a rough estimate of the overall quality of the algorithms: \begin{equation} C=\left|\frac{\chi^2}{fN}-1\right|\frac{AP}{f}. \end{equation} {\bf [EL: Not sure where this combination comes from. A standard statistical measure is the mean squared error or risk: $R=\sum_i\sqrt{\sigma_i^2+(\tilde{\Delta t}_i-\Delta t_i)^2}=\sum_i \sigma_i\sqrt{1+\chi^2_i}$. Again we could apply this only for the "apparently precise" fraction $f$.]} The results will not be revealed until the end of the challenge in order to maintain blindness. The deadline for TDC1 is 1st May July 2014, i.e. six months after TDC0. Multiple submissions are accepted from each team in order to allow for correction of bugs, and for different algorithms. However, only the most recent submission for each algorithm will be considered in order to avoid favoring teams with multiple submissions. Late submissions will be accepted and included in the final publication if received in time but will be flagged as such. \subsubsection{Publication of the results}