Bayesian Optimisation over the Hyperparameters

In this section we are concerned with our model hyperparameters.

Kernel Approximation Hyperparameters

All features:
- k: number of components/dimension of feature-space.
Sparse features:
- r: value below which a kernel value will be set to zero.

SGD Hyperparameters

\(\boldsymbol{\alpha}\): the descent rate of our gradient method.
\(\boldsymbol{\lambda}_1\): regularisation parameter (\(\mathcal{l}_1\) norm).
\(\boldsymbol{\lambda}_2\): regularisation parameter (\(\mathcal{l}_2\) norm).

Bayesian Optimisation

In this formulation we treat the classifer as a black-box function that maps from the hyper-parameter space to some evaluation metric. In our case, we map from the above hyper-parameters to the negative of the area-under-curve value (as is customary we seek to minimise the objective). This is an incredibly useful and flexible model.

Modelling the hyperparameters as continuous random variables allows us and arbitrary degree of precision. This is particularly useful for problems where we have little or no strong prior information available to us. We can start with wide bounds on our hyperparameters and narrow our focus. A similar approach with grid-search would generally involve multiple trials, with different hard-set granularity each time.

Another salient feature of this problem which makes it amenable to this approach is the hyper-parameter dependent run-time of our classifier. This is particularly apparent when we consider the k hyperparameter: the dimension of the estimated kernel-space. It is reasonable to believe that run-time increases monotonically with k, assuming our other hyperparameters are fixed. As such, with an approach such as grid-search we might be forced to run with a small k and hope that the results generalise. Through Bayesian optimisation, however, we can intelligently decide to pursue higher k values if it is reasonable to assume that by doing so we can better make use of our time/computational resources.

Cost

The Bayesian optimisation paradigm is particularly computationally efficient for the task at hand. It allows us to optimise arbitrary functions in relatively few iterations. The downside is that there is more costly inter-iteration work performed. Given that the occupancy-grid problem typically involves datasets with a massive number of samples, we find that in our case the intra-iteration work dominates our runtime and as such it's a good trade-off to make.

We also consider that the most common alternative is to perform a grid-search. This is a method that might be reasonable for models that are efficient to train or have a small number of hyperparameters. In the absence of these qualities, however, it performs quite poorly. Given our large number of hyper-parameters and the range of values they can take, grid-search's exponential run-time is particularly concerning.

Finally, our model in theory allows for a high degree of parallelisation. We have to do a little more work compared to grid-search, as we don't want to repeat experiments, but in essence we have a large number of independent experiments and as such we can do work across multiple cores.

Choosing the Next Hyperparameter

We adopt the formulation found in \cite{snoek2012practical}. Our hyperparameters are assumed a-priori multivariate-normal. We determine the next hyperparameters by choosing those which offer us the most expected improvement. That is,