Highlights
- We explain how to perform Bayesian inference for time series where each data point is the cumulative maximum (or minimum) of an i.i.d. series.
- We compare the results of this framework to a classic minimum mean square error (MMSE) frequentist approach. We use world record data from six athletic events. We find a similar performance between the bayesian mean posterior estimate and the frequentist approach in terms of mean squared error.
- We explore the effect of the choice of distribution of attempts. We find that assuming a Weibull distribution marginally outperforms a Gaussian distribution and that both robustly outperform a Gumbel distribution of attempts.
- We forecast world records for 11 categories of athletic events for the 2022 to 2032 period.
- We introduce fmax, a Python open-source package to model and forecast time series of cumulative minima and maxima. The package can be found at https://github.com/jlindbloom/fmax.
How often and by how much are Olympic records beat? What score do we expect future machine learning systems to attain for classification tasks in the absence of new breakthroughs? With what probability will the fasted speed run for our favorite videogame be beaten within the next year?
In situations such as these, we are interested in characterizing how a historical record has evolved and will evolve in the future. And while properties of order statistics such as the maximum over a set of random variables are well-studied, the running maximum (or minimum) of a time series is markedly less so.
An example of such work is presented in \citep{tryfos_forecasting_1985}, in which the authors present a model for the world record in six major running events using an i.i.d. distribution for attempts. In this article, we present the corresponding Bayesian approach to such models. We use this model to derive predictions for the men's and women's data for the same events considered in \citep{tryfos_forecasting_1985} and show comparable performance in terms of the squared error given the actual records that followed.
We also discuss the effects of the choice of attempt distribution. We find that a Weibull distribution gives the best fit in terms of loglikelihood, marginally outperforming a Gaussian distribution and robustly outperforming a Gumbel distribution.
Finally, we provide a forecast of records in the next decade for the 11 categories of athletic events we collected data on.
Previous work
\cite{tryfos_forecasting_1985} develops the minimum mean square error (MMSE) estimator for a series of cumulative minimums. They derive the estimator for both a normal distribution and extreme value distributions of attempts. In particular, they apply their approach to forecast the world records of six running events, assuming an underlying Gumbel distribution with density \(f\left(x\right)=\exp{(\frac{x-\mu}{\sigma}-e^{\frac{x-\mu}{\sigma}})} \frac{1}{\sigma}\). We use this article as a basis to compare our approach. In section \ref{814902} we show how our approaches compare using the same data as the authors.
\cite{smith1988} expands the work of Tryfos and Blackmore to derive the maximum likelihood estimator for a series of cumulative minimums where the attempt distribution is the sum of an i.i.d. random variable \(X_n\) and a nonrandom drift trend.
\[Y_n = X_n + c_n\]
They consider random distributions including the Gaussian, Gumbel and Generalized Extreme Value (GEV) distributions. The drift trends considered include linear drift, quadratic drift, and exponential decay models. The author applies the method to model records in the mile and marathon races.
We could in theory adapt their approach assuming a zero-drift trend \(c_n = 0\). However, in practice with the data, we considered and in short time scales this results in constant extrapolated forecasts \(\hat Y_{n+k} = Y_n\). Future work may include extending the Bayesian framework presented in this paper to the case of non-zero drift and comparing it to this paper.
\citep{miller1986} follow Smith by considering a Gumbel model with linear drift, but work within a state-space approach to explicitly construct the forward-looking predictive distribution for the model. They also consider a Bayesian formulation, applying their method to the forecasting of athletic records similar to Tryfos and ourselves.
\citep{Wergen_2013} focuses on the related problem of modeling the probability that timestep n will be a new record given historical data. The authors assume a linear drift in the attempts.
\citep{sym12091443} derive the Jeffrey prior for the Gumbel distribution, and derive the density function of the conditional forecasted distribution of records of i.i.d. Gumbel variables given previous observations. They compare their result to an ARIMA and DLM approach. As we will discuss in this article, the Gumbel distribution seems to underperform relative to the Weibull and Gaussian distribution, suggesting a natural extension to their work.
Beyond the aforementioned articles, we could not find much work on forecasting cumulative records, especially from a Bayesian perspective. This points to a gap in the literature that we aim to fill.
The General Model
In this section, we introduce a general framework for deriving the likelihood of the distribution.
First we establish the notation we will use through the paper. Then we derive the likelihood functions for a cumulative distribution of maxima and minima. These are, respectively:
\[\mathcal{L}_{Y_{1:n}}(y_1, \dots, y_N) = \prod_{i\in R} f_X(y_i) \prod_{i\not\in R} F_X(y_i)\]