Authorea

Stefano added file time_series_convergence.tex over 9 years ago

Commit id: 43af59a4e66f5081a6d4f876c9cf198a6acd00f2

deletions | additions

\documentclass[11pt]{article} \usepackage{amsmath,amssymb, amsfonts} \usepackage{graphicx} \begin{document} \section{Time series convergence} Early visual inspection of the square displacement, $(x-x_0)^2$, histograms showed a sudden, innatural, truncation of the histogram at some large displacements. The point at which the truncation occurrs grows as the number of iterations increases. This observation suggests that the system has not reached equilibration and, as a matter of fact, it is still diffusing within the basin. This is a serious concern because when measuring the mean square displacement we want to make sure that the number of iterations is sufficiently large to reach the boundaries of the basin of attraction when the walker is not coupled to the origin ($k=0$). Asenjo et al. found that the weakly coupled replicas have the tendency to get trapped into far and tiny regions of the basin. Swapping these replicas with the more strongly coupled replicas (with larger $k$), that oscillate around the origin, then improves equilibration. Yet, the timescales over which we were performing the calculations (1e5-1e6) did not seem to be sufficient to reach the boundaries of the basin. See Fig.~(\ref{fig:timeseries}) for an example. \begin{figure}[t] \centering \begin{minipage}{\textwidth} \centering \includegraphics[width=\linewidth]{./figures/time_series.eps} \caption{Time series for a $n=32$ particle system at packing fraction $\phi_{HS}=0.7$ and $\phi_{SS}=0.88$ ran for about $2 \times 10^6$ MCMC steps, one $1/100$ of all points are shown.} \label{fig:timeseries} \end{minipage} \end{figure} Since we do not know how long it will take to reach the boundary of the basin and we cannot afford to run the calculations for an exceedingly large number of steps, we would like to develop an automated way of adapting the calculation length to reach some target relative statistical error. The statistical error $\Delta_A$, the root-mean square deviation of the sample mean $\overline{A}$ from the true expectation value $\langle A \rangle$, is given by: \begin{equation} \label{eq:delta1} \Delta_A^2 = \frac{\text{Var}A}{M} + \frac{1}{M^2}\sum_{i \neq j = 1}^M (\langle A_i A_j\rangle = \langle A \rangle^2), \end{equation} where $M$ is the sample size. Under the assumption of independence \begin{equation} \langle A_i A_j\rangle = \langle A_i \rangle \langle A_j \rangle = \langle A \rangle^2, \end{equation} and the second term in Eq.~\ref{eq:delta1} vanishes. Since we are exploring the basin using a MCMC our samples are far from being independent and a more accurate estimate for the statistical error can be derived under the assumption of fast decay as $|i-j|\rightarrow \infty$. With some algebra \cite{Ambeogacar} it can be shown that the second term on the RHS of Eq.~\ref{eq:delta} is equal to \begin{equation} \label{eq:delta2} \frac{1}{M^2}\sum_{i \neq j = 1}^M (\langle A_i A_j\rangle = \langle A \rangle^2) \approx \frac{2}{M}\sum_{t = 1}^{\infty}(\langle A_1 A_{1+t}\rangle- \langle A\rangle^2) \equiv \frac{2}{M} \text{Var}A \tau_A, \end{equation} where we have used the definition of the \emph{integrated autocorrelation time} of A \begin{equation} \tau_A = \frac{\sum_{t=1}^{\infty}(\langle A_1 A_{1+t}\rangle- \langle A\rangle^2)}{\langle A^2\rangle- \langle A\rangle^2}. \end{equation} Substituting Eq.~(\ref{eq:delta2}) into Eq.~(\ref{eq:delta1}) we obtain an error estimate that takes into account the correlation effects: \begin{equation} \label{eq:delta3} \Delta_A^2 = \frac{\text{Var}A}{M} (1 + 2 \tau_A) \end{equation} where $\xi_A = 1 + 2 \tau_A$ is the \emph{statistical inefficiency} and $M/\xi_A < M$ is the effective number of uncorrelated samples. We can compute the integrated autocorrelation time and hence the statistical inefficiency using the \texttt{pymbar} library. We can rearrange Eq.~(\ref{eq:delta3}) to find \begin{equation} M = \frac{\text{Var}A}{\Delta_A^2} (1+2\tau_A), \end{equation} then dividing and multiplying the denominator by $\langle A \rangle^2$ we find \begin{equation} \label{eq:ptiter_prediction} M = \frac{\text{Var}A}{\langle A \rangle^2 \epsilon_A^2} (1+2\tau_A), \end{equation} where $\epsilon_A = \Delta_A / \langle A \rangle$ is the relative standard error. Hence, given an estimate of the mean, variance and statistical inefficiency of the time series we can estimate for how many more steps the calculation needs to run to meet a specific relative standard error target, taking into account the correlated nature of the data. The practical implementation of this automatic prediction for the computation time is complicated mainly by the fact that: \begin{itemize} \item the time series shows an initial fast growth (burnout region) after which a seemingly stationary state is approaches; \item the value of the integrated autocorrelation time depends strongly on the length of the time series. \end{itemize} We set 2 parameters \texttt{eq\_min\_ptiter} and \texttt{eq\_max\_ptiter} which define in turn the minimum and maximum number of PT iterations. \texttt{eq\_min\_ptiter} has to be greater than the number of steps over which the stepsize is equilibrated and is generally a guess for what the equilibration time should be, by default $5e5$ steps. \texttt{eq\_max\_ptiter} is by default set to be $20$ times \texttt{eq\_min\_ptiter}. After \texttt{eq\_min\_ptiter} steps we call the \texttt{pymbar.detectEquilibration} function to find the initial burnout region of the time series to discard. \texttt{pymbar.detectEquilibration} loops over ever shorter segments of the time series looking for the one that gives the largest number of effectively uncorrelated samples, that being the equilibrated set of data, and it returns an equilibration time. From this point on we will only consider points in the timeseries past this equilibration time. We would now expect to be dealing with a stationary time series while, as a matter of fact, the system is probably still slowly drifting. In any case, we want to estimate the integrate autocorrelation time which depends strongly on the length of the time series. When truncating the time series we might have been left with only a short array of point. For this reason we return as the predicted number of PT iterations \texttt{eq\_max\_ptiter}. When the time series is at least $1e5$ in length we then start updating the number of predicted PT iterations from Eq.~(\ref{eq:ptiter_prediction}). Since all replicas run for the same amount of time we take the largest number of steps from one of the replicas, typically the one corresponding to $k=0$. \end{document}