Authorea

Xavier Holt edited Bayesian_Optimisation_over_the_Hyperparameters__.md over 8 years ago

Commit id: 5a15bdccd54cf66c7efa0768f6c7840ce0afb876

deletions | additions

The Bayesian optimisation paradigm is particularly computationally efficient for the task at hand. It allows us to optimise arbitrary functions in relatively few iterations. The downside is that there is more costly inter-iteration work performed. Given that the occupancy-grid problem typically involves datasets with a massive number of samples, we find that in our case the intra-iteration work dominates our runtime and as such it's a good trade-off to make. We also consider that the most common alternative is to perform a grid-search. This is a model method that might be reasonable forefficient models that are efficient to train or have a small number of hyperparameters, but in hyperparameters. In the absence of these qualities qualities, however, it performs quite terribly. poorly. Given our large number of hyper-parameters and the range of values they can take, grid-search's exponential run-time is particularly concerning. Finally, our model in theory allows for a high degree of parallelisation. We have to do a little more work compared to grid-search, as we don't want to repeat experiments, but in essence we have a large number of independent experiments and as such we can do work across multiple cores. ## Choosing the Next Hyperparameter We adopt the formulation found in \cite{snoek2012practical} \cite{snoek2012practical}. Our hyperparameters are assumed a-priori multivariate-normal. This results in a procedure that can find the minimum of difficult non-convex functions with relatively few evaluations, at the cost of performing more computation to determine the next point to try. When evaluations of f(x) are expensive to perform — as is the case when it requires training a machine learning algorithm — then it is easy to justify some extra computation to make better decisions This prior and these data induce a posterior over functions; the acquisition function, which we denote by a : X → R +, determines what point in X should be