\documentclass{article}
\usepackage[affil-it]{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
\usepackage{url}
\usepackage{hyperref}
\hypersetup{colorlinks=false,pdfborder={0 0 0}}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{amsmath}
\usepackage{listings}
\lstset{ %
language = Python
}
\newtheorem{theorem}{Theorem}
\newtheorem{proof}{Proof}
\newtheorem{lem}{Lemma}
\newcommand{\truncateit}[1]{\truncate{0.8\textwidth}{#1}}
\newcommand{\scititle}[1]{\title[\truncateit{#1}]{#1}}
\begin{document}
\title{A Solution to the Hyperparameter Dependency of Hilbert Maps}
\author{Xavier Holt}
\affil{Affiliation not available}
\date{\today}
\maketitle
\selectlanguage{english}
\begin{abstract}
Kernel-approximation methods in tandem with efficient online classifiers have proven to be a highly effective solution to the occupancy map problem. In this paper we seek to expand upon the work done in this area. Our contributions are twofold. We demonstrate that a Bayesian logistic regression classifier fails to perform better than the more spartan point-estimator/subgradient descent method. Contrastingly, we show that Bayesian optimisation over the hyperparameters of the model is an incredibly powerful and useful tool for this application.
%
\end{abstract}%
\section{Introduction}
The occupancy map problem learns a probabilistic model of some space. We map every point within to an estimation of the probability that it is occupied. For a primer on the problem, see \cite{ramoshilbert}.
Recently, a novel approach for solving this problem in an effective and on-line manner was proposed. The method, called hilbert-maps, transform a low-dimensional feature vector to a more dense space. This grants a simple linear classifier strong expressibility and power. The mapping process is further optimised through the use of kernel-approximator methods.
Such a model has been demonstrated to perform very well in terms of accuracy/time tradeoff \cite{ramoshilbert}. It does however suffer from a dependence on a number of hyperparameters. Solving this problem would allow for better results and a wider application.
To this end, in \textit{Section 2} we explore a model-shift from frequentist to Bayesian logistic regression formulation. Our goal is to firstly determine whether we can get more accurate models. Also of consideration is whether the paradigmatic shift in our characterisation of uncertainty results in less hyper-parameter dependency.
In \textit{Section 3} of this paper, our focus is on Bayesian parameter optimisation. We attempt to tackle the hyper-parameter problem more directly through the use of intelligent automatic selection algorithms.
\section{Bayesian Logistic Regression
Formulation}\label{bayesian-logistic-regression-formulation}
Define the logistic sigmoid function \(\sigma(\alpha) = \frac{1}{1 + \exp(-\alpha)}\). A logistic
regression models a binary-response variable with the following density
function:
\begin{align}
p(y_i = 1 \mid \mathbf{x_i, w}) &= \sigma(\mathbf{w^T x_i})\\ \\
p(y_i = 0 \mid \mathbf{x_i, w}) &= 1 - \sigma(\mathbf{w^T x_i})\\
&= 1 - \frac{1}{1 + \exp (- \mathbf{w^Tx_i}}) \\
&= \frac{1}{1 + \exp(\mathbf{w^Tx_i})} \\
&= \sigma(\mathbf{-w^Tx_i})
\end{align}
Where \(\mathbf{w^T}\) is a vector of weights, and \(\mathbf{x_i}\)
is our original data point with the addition of an identity element to
allow for a intercept term. That is, if our original data was of the
form \(\tilde{\mathbf{x}}_{i} = (x_{i1}, x_{i2}, \dots, x_{id-1})^T \in \mathbb{R}^{d-1}\), then our transformed data point would be
represented as \((1, x_{i1}, x_{i2}, \dots, x_{id-1})^T := \mathbf{x_i} \in \mathbb{R}^{d}\). The addition of this dimension is
used widely, for example in simple linear regression.
Our learning task is then to find some estimation of this weight vector
\(\mathbf{w^T}\), given our dataset \(\mathcal{D} = (\mathbf{Y_n, X_n} )\).
The traditional approach is to find a point-estimator through a
maximum-likelihood approach. This is equivalent to formulating an
objective function of the likelihood of our estimator given the data,
and optimising through any appropriate method. Alternatively and
equivalently, we could optimise over the log-likelihood. Either way, it
turns out that the objective function is strictly convex
\cite{rennie2005regularized}. This means we can get a reasonable point-estimate
quite efficiently, for example through the use of stochastic gradient
descent. That is, we represent the problem approximately as:
\begin{align}
\hat{\mathbf{w}}_{MLE} &= \text{argmax}_{\mathbf{w}} \ L(\mathbf{w} \mid \mathcal{D}) \\
&= \text{argmax}_{\mathbf{w}} \prod p(y_i \mid \mathbf{x}_i, \mathbf{w})\\
&= \text{argmax}_{\mathbf{w}} \sum \log p(y_i \mid \mathbf{x}_i, \mathbf{w})
\end{align}
Where our second equality sign is guaranteed by the independence of
sample points. Such an estimator is quite prone to overfitting. To avoid
this it is common to see a regulariation function that penalises weights
proportional to their magnitude, for some metric. This point-estimate
approach is not the focus of our paper however, so we mention it only in
passing.
Instead of a point-estimate for our weights, it's possible to develop an
distribution estimator. This is useful for several reasons. Having a
distributional result allows us to get a strong understanding of the
confidence of our predictions. Intelligent choice of priors can also
allow us to overcome the overfitting problem. It also affords us the
opportunity to update our belief of our predictions in intelligent ways.
This fully Bayesian model is related to but distinct from the maximum
a-posteriori (MAP) point-estimator.
Naturally our first step in developing such an estimator is to decide on
a reasonable prior distribution for our parameters.
\subsection{Prior for Weights}
In pursuit of model generalisability, we adopt the framework that weights should be at or near zero with high probability. The most widely used weight-distribution in logistic regression is Guassian, so we take this approach as a useful starting point.
\subsubsection{Guassian Prior: $\mathbf{w} \sim \mathcal{N}_{d\times d}(\mathbf{0}, \text{diag}(\boldsymbol{\sigma}))$}
A zero-mean vector is an obvious starting point. Additionally, we have opted to assume that the weights are heteroscedastic but independent of one another. That is, the covariance is a diagonal $d\times d$ matrix $\Sigma = \text{diag}(\boldsymbol{\sigma}) = \text{diag}(\sigma_1, \sigma_2, \dots \sigma_d)$. Consequently,
$$p(\mathbf{w} | \boldsymbol{\sigma}) = (2\pi )^{-\frac{d}{2}} | \Sigma | ^{-\frac{1}{2}} \exp(-\frac{1}{2} \mathbf{w^T} \Sigma ^{-1} \mathbf{w})$$
\begin{lem}[Diagonal Matrices]
\begin{enumerate}
\item The inverse of a diagonal matrix is readily computable as $(\text{diag}(a_1, a_2,\dots))^{-1} = \text{diag}(a_1^{-1}, a_2^{-1}\dots) $.
\item The roots of a diagonal matrix are similarly trivial with $(\text{diag}(a_1, a_2,\dots))^{1/k} = \text{diag}(a_1^{1/k}, a_2^{1/k},\dots)$.
\item For all matrices with well defined inverse and roots $A^{-1/k}=(A^{1/k})^{-1} = (A^{-1})^{1/k}$.
\item The determinant of a diagonal matrix is the product of the diagonal elements.
\end{enumerate}
\end{lem}
It is then readily apparent that, given our diagonal covariance matrix $\Sigma$ we have $\Sigma^{-1} = \text{diag}(\sigma_1^{-1}, \dots, \sigma_d^{-1})$ and similarly $|\Sigma^{-1}| = |\text{diag}(\sigma_1^{-1}, \dots, \sigma_d^{-1})| = \prod(\sigma_i^{-1})$.
Using this information, it can be shown that finding the MAP estimator of $\mathbf{w}$ is equivalent to solving the maximum likelihood problem with a ridge regularisation parameter \cite{Hoerl_1970}. The success and ubiquity of this approach then speaks to the reasonableness of assuming Guassian priors. That is,
\begin{align}
\hat{\mathbf{w}} &= \text{argmax}_\mathbf{w} \log \left\{ L(\mathbf{w} \mid \mathcal{D})\times p(\mathbf{w} \mid \boldsymbol{\sigma} )\right\}\\
& = \text{argmax}_\mathbf{w} \left\{ \log(L(\mathbf{w} \mid \mathcal{D})) + \log p(\mathbf{w} \mid \boldsymbol{\sigma}) \right\}\\
& := \text{argmax}_\mathbf{w} \left\{ \mathcal{l}(\mathbf{w} \mid \mathcal{D}) - \|\Sigma^{-1}\mathbf{w}\|^2 \right\}
\end{align}
As mentioned, this has some reasonable qualities and has been shown to perform quite well \cite{Zhang_2003,fan2003loss}. The downside is that, while the Guassian prior favours values particularly close to zero, it does not significantly favour parameters being exactly equal to zero. This is of particular relevance to the problem at hand, as our ability to define an arbitrary number of features and hence dimensions means that it would be beneficial to enforce some sparsity on the weights. It turns out that regularisers that penalise proportional to the $\mathcal{l}^1$ norm of the weights, instead of $\mathcal{l}^2$ norm, achieve this result. To this end, we make the following observation:
\subsubsection{Laplace Prior: $w_i \sim_{ind} \text{Laplace}(0, \lambda_i)$}
Let us now assume that the weights are a-priori Laplacian. This implies that for each of the independent weights we have:
$$p(w_i \mid \lambda_i) = (2\lambda_i)^{-1} \exp({- |w_i| \lambda_i^{-1}})$$
Where again we enforce the weights are centered at zero. Correspondingly, due to our assumption of independence between weights, the logical conjunction of our weights is simply their product. That is,
\begin{align}
p(w_1, \dots, w_d \mid \lambda_1, \dots, \lambda_d) &:= p(\mathbf{w} \mid \boldsymbol{\lambda})\\
&= \prod_{i\in 1:d} (2\lambda_i)^{-1} \exp({- |w_i| \lambda_i^{-1}}) \\
&= \left(\prod_{i\in 1:d} (2\lambda_i)^{-1}\right) \left(\prod_{i\in 1:d} \exp({- |w_i| \lambda_i^{-1}})\right) \\
&= 2^{-d} \left(\prod_{i\in 1:d} \lambda_i^{-1} \right) \exp\left\{ -\sum_{i\in 1:d} |w_i| \lambda_i^{-1} \right\}
\end{align}
We will now seek to assure ourselves that this does in fact result in an MAP point-estimate based on the absolute error. First, we apply a monotonic function to make things easier for ourselves:
\begin{align}
\text{argmax}_{\mathbf{w}} p(\mathbf{w} \mid \boldsymbol{\lambda})
&= \text{argmax}_{\mathbf{w}} \log p(\mathbf{w} \mid \boldsymbol{\lambda}) \\
&= \text{argmax}_{\mathbf{w}} \left\{ -d\log(2) - \sum \log(\lambda_i) - \sum |w_i| \lambda_i^{-1}\right\}\\
&= \text{argmax}_{\mathbf{w}} \left\{ - \sum |w_i| \lambda_i^{-1} \right\}
\end{align}
Where in fact we note that we have retrieved the `lasso' regularisation function. This observation was first made by \cite{Tibshirani_2011} and has since been employed in numerous different applications \cite{Tibshirani_2004, Figueiredo_2003, Girosi_1998}. The `lasso' regulariser has been demonstrated to enforce parameters not only close to but equal to zero.
To demonstrate this, we've included comparison of the two distributions for a given standard deviation.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.84\columnwidth]{figures/DIST/DIST}
\caption{{Laplacian and Normal Distributions with SD = (1,2,3)%
}}
\end{center}
\end{figure}
It is thereby clear that there is far more mass centered at zero for the
Laplacian prior, given an equal standard deviation. It is worth noting
that given Laplacian prior our objective is not differentiable at zero,
and so any gradient descent approaches have to be applied with care.
This is only mentioned in passing, as again we aren't pursuing an
optimisation/point-estimate model.
Given some parametisation of our prior weight distribution we can then
begin to formulate our approach.
\subsection{Formulating the Integral}
In addition to formulating an distributional estimator over the parameter space, we'd like to be able to make predictions on new data.
\begin{align}
p(y_{n+1} \mid \mathbf{x}_{n+1}, \mathcal{D_n})
&= \int_{\mathbb{R}^d} p(y_{n+1}, \mathbf{w} \mid \mathbf{x}_{n+1}, \mathcal{D_n}) d\mathbf{w}\\
&= \int_{\mathbb{R}^d} p(y_{n+1} \mid \mathbf{x}_{n+1}, \mathcal{D_n}, \mathbf{w}) p(\mathbf{w} \mid \mathbf{x}_{n+1}, \mathcal{D_n}) d\mathbf{w} \\
&= \int_{\mathbb{R}^d} p(y_{n+1} \mid \mathbf{x}_{n+1}, \mathbf{w}) p(\mathbf{w} \mid \mathcal{D_n}) d\mathbf{w}\\
&= \int_{\mathbb{R}^d} p(y_{n+1} \mid \mathbf{x}_{n+1}, \mathbf{w}) p(\mathbf{w} \mid \mathbf{Y_n}) d\mathbf{w}
\end{align}
We justify the the third equality through $y_{n+1} \mid \mathbf{w} {\ \perp \!\!\!\!\! \perp\ } \mathcal{D}_n$} (by independence of sampling) and $\mathbf{w} {\ \perp \!\!\!\!\! \perp\ } \mathbf{x}_{n+1}$ (by assumption). Similarly, the fourth equality follows from $\mathbf{w} \mid \mathbf{Y_n} {\ \perp \!\!\!\!\! \perp\ }\mathbf{X_n}$ and $\mathcal{D_n} := (\mathbf{X_n, Y_n})$.
The left density function is clearly just that of the logistic regression. Our goal then is to find some estimation of the density on the right.
We note that
\begin{align}
p(\mathbf{w} \mid \mathbf{Y_n}) &= \frac{p(\mathbf{Y_n} \mid \mathbf{w}) p(\mathbf{w})}{p(\mathbf{Y_n})}\\
&\propto p(\mathbf{Y}_n \mid \mathbf{w}) p(\mathbf{w}) \\
\end{align}
Which is just the product of our logistic regression probabilities and our prior distribution above. With either Normal or Laplacian priors, it's clear that this convolution is intractable. As such, we develop an approximation of the posterior distribution.
\subsection{Slice Sampling}\label{slice-sampling}
Slice sampling is a non-deterministic method used to sample from
arbitrary curves. Specifically slice sampling belongs to the family of
Markov chain Monte-Carlo (McMC) methods. The general principle behind
such methods is to construct a Markov chain that has the desired
distribution as an equilibrium solution. They are used to obtain numeric
approximations from multi-dimensional integrals for which no closed-form
solution is readily available. We can then use the samples to construct
estimations of statistics on the underlying distribution (mean, mode,
variance etc.) by simply calculating the equivalent sample-statistic.
Slice sampling in particular possesses several very desirable properties
\cite{murray2009elliptical}. It allows us to sample from integrals that are
\emph{proportional} to our underlying distribution. There is no
assumption that our probability mass sums to one. Consequently our
intractable denominator above is of no concern -- we simply proceed
using the scaled posterior.
Additionally, slice sampling has no free parameters. This is useful from
a practical perspective, as it removes the need to spend time tuning and
updating hyper-parameters for more complex models. It also impacts on
performance; the more general Metropolis-Hastings model is highly
sensitive to step-size parameters, which means unless we get it right
we're going to perform very poorly on evaluation. In contrast, slice
sampling adjusts the step-size to match the local shape of the density
function. There are several modified slice-samplers which are heavily
parallelisable \cite{Tran_2015}.
In our analysis we use the following slice-sampling algorithm,
iteratively tuning on each subsequent weight-parameter. In some sense
this is analogous to the iterative processing present in stochastic
gradient descent.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/im/im}
\caption{Slice-Sampling Methods%
}
\end{center}
\end{figure}
Where $f$ is a function proportional to our density, $x_0$ is our current sample point, y is the horizontal level and $w/p$ control the width of our interval.
We generate a parameter estimation through taking a sequence of samples. We then substitute our estimated weight-distribution back into the integral above, allowing us to predict new points. In terms of implementation, in this paper we use the Laplacian prior distribution with $\lambda_i = \lambda \ \forall i$ being an equal regularisation hyperparameter. For our initial experiments we simply set this to 10 through cross-validation and inspection.
\section{Bayesian Optimisation over the
Hyperparameters}\label{bayesian-optimisation-over-the-hyperparameters}
In this section we are concerned with our model hyperparameters.
\paragraph{Kernel Approximation
Hyperparameters}\label{kernel-approximation-hyperparameters}
\begin{itemize}
\tightlist
\item
All features:
\begin{itemize}
\tightlist
\item
\textbf{k}: number of components/dimension of feature-space.
\end{itemize}
\item
Sparse features:
\begin{itemize}
\tightlist
\item
\textbf{r}: value below which a kernel value will be set to zero.
\end{itemize}
\end{itemize}
\paragraph{SGD Hyperparameters}\label{sgd-hyperparameters}
\begin{itemize}
\tightlist
\item
\(\boldsymbol{\alpha}\): the descent rate of our gradient method.
\item
\(\boldsymbol{\lambda}_1\): regularisation parameter (\(\mathcal{l}_1\)
norm).
\item
\(\boldsymbol{\lambda}_2\): regularisation parameter (\(\mathcal{l}_2\)
norm).
\end{itemize}
\subsection{Bayesian Optimisation}\label{bayesian-optimisation}
In this formulation we treat the classifer as a black-box function that
maps from the hyper-parameter space to some evaluation metric. In our
case, we map from the above hyper-parameters to the negative of the
area-under-curve value (as is customary we seek to minimise the
objective). This is an incredibly useful and flexible model.
Modelling the hyperparameters as continuous random variables allows us
and arbitrary degree of precision. This is particularly useful for
problems where we have little or no strong prior information available
to us. We can start with wide bounds on our hyperparameters and narrow
our focus. A similar approach with grid-search would generally involve
multiple trials, with different hard-set granularity each time.
Another salient feature of this problem which makes it amenable to this
approach is the hyper-parameter dependent run-time of our classifier.
This is particularly apparent when we consider the \textbf{k}
hyperparameter: the dimension of the estimated kernel-space. It is
reasonable to believe that run-time increases monotonically with
\textbf{k}, assuming our other hyperparameters are fixed. As such, with
an approach such as grid-search we might be forced to run with a small
\textbf{k} and hope that the results generalise. Through Bayesian
optimisation, however, we can intelligently decide to pursue higher
\textbf{k} values if it is reasonable to assume that by doing so we can
better make use of our time/computational resources.
\subsection{Cost}\label{cost}
The Bayesian optimisation paradigm is particularly computationally
efficient for the task at hand. It allows us to optimise arbitrary
functions in relatively few iterations. The downside is that there is
more costly inter-iteration work performed. Given that the
occupancy-grid problem typically involves datasets with a massive number
of samples, we find that in our case the intra-iteration work dominates
our runtime and as such it's a good trade-off to make.
We also consider that the most common alternative is to perform a
grid-search. This is a method that might be reasonable for models that
are efficient to train or have a small number of hyperparameters. In the
absence of these qualities, however, it performs quite poorly. Given our
large number of hyper-parameters and the range of values they can take,
grid-search's exponential run-time is particularly concerning.
Finally, our model in theory allows for a high degree of
parallelisation. We have to do a little more work compared to
grid-search, as we don't want to repeat experiments, but in essence we
have a large number of independent experiments and as such we can do
work across multiple cores.
\subsection{Choosing the Next
Hyperparameter}\label{choosing-the-next-hyperparameter}
We adopt the formulation found in \cite{snoek2012practical}. Our
hyperparameters are assumed a-priori multivariate-normal. We determine
the next hyperparameters by choosing those which offer us the most
expected improvement. That is,
$$\mathbf{x}_{n+1} = \text{argmax}_{\mathbf{x}} \sigma(\mathbf{x}) \left(\gamma(\mathbf{x})\Phi(\gamma(\mathbf{x})) + \mathcal{N}(\gamma(\mathbf{x}), \mathbf{I})\right))$$
Where $\gamma( \mathbf{x} ) := \left( f ( \mathbf{x}_\text{best}) - \mu(\mathbf{x})\right) / \sigma(\mathbf{x})$; $\sigma, \mu$ are our predictive mean and standard-deviation functions, and $\Phi$ is the CDF of a standard multivariate normal distribution of appropriate dimension. This is solvable analytically given Guassian priors \cite{snoek2012practical}.
\section{Experiments}\label{experiments}
Unless specified, the default hyperparameters follow those found in \cite{ramoshilbert}.
\subsection{Logistic Bayes}\label{logistic-bayes}
Each cell is the average of 10 different hyper-parameter configurations sampled from our cross-validation grid.
\begin{table}[]
\centering
\caption{Ratio Comparison of the form \textit{Bayesian/MLE}}
\begin{tabular}{lll}
& \textbf{AUC Ratio} & \textbf{Runtime Ratio} \\ \hline
\textbf{Fourier Intel} & 0.91 & 13.3 \\
\textbf{Nystroem Intel} & 0.93 & 14.6 \\
\textbf{Sparse Intel} & 0.83 & 12.8
\end{tabular}
\end{table}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/WEIGHTS-DIST/WEIGHTS-DIST}
\caption{A random selection of several of the weight parameter distributions, with MLE/MAP estimate superimposed.%
}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/SamplingCONV/SamplingCONV}
\caption{As above, with the sample by iteration.%
}
\end{center}
\end{figure}
\subsection{Gridsearch}\label{gridsearch}
Gridsearch is included here only as a tangential reference. Direct comparisons are difficult due to the heavily curated nature of an exhaustive grid search in comparison to the automation of the Bayesian model.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/FN/FN}
\caption{Sample of Grid Search (Fourier and Nystroem)%
}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/IM2/IM2}
\caption{Sample of Grid Search (Sparse)%
}
\end{center}
\end{figure}
\subsection{Bayesian Hyperparameter Optimisation}
In order to visually demonstrate the behaviour of the model, we have limited the following experiments to two free parameters: gamma and distance cutoff. It's worth noting that the Bayesian optimisation model performs very well with large numbers of hyperparameters, and as such our performance relative to the grid-search above will only improve.
The first graph demonstrates the best AUC metric found as a function of the iteration.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/GOOD-MAX/GOOD-MAX}
\caption{Best AUC by Time%
}
\end{center}
\end{figure}
\subsubsection{Stochastic Process of
Hyperparameters}\label{stochastic-process-of-hyperparameters}
Here we have the value of each hyperparameter as a function of
iterations. We note that the path is quite dynamic, as is to be expected
given our model.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/GAMMA/GAMMA}
\caption{Gamma by Time%
}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/DIST1/DIST1}
\caption{Distance by Time%
}
\end{center}
\end{figure}
\subsubsection{Distribution of
Hyperparameters}\label{distribution-of-hyperparameters}
Below we have included a representation of the estimated posterior
distribution.
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/HEATMAP/HEATMAP}
\caption{Heatmap%
}
\end{center}
\end{figure}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/CONTOUR/CONTOUR}
\caption{Contour Map%
}
\end{center}
\end{figure}
\section{Discussion}
In terms of results, our Bayesian logit model was slower and less accurate. As a sanity check, our MLE estimated is correctly situated around the peak of each of our posterior distributions. This is unsurprising, as we have ensured that our $\lambda$ parameters align in terms of posterior distribution/goal function. While there is a clear Markovian behaviour present in the sampling process, the formulation of our algorithm guarantees us convergence in distribution, at least in the limit. As such we do not attribute the model's poor behaviour to this.
In relation to Bayesian optimisation, we propose it is a highly effective algorithm apropos its convergence behaviour, flexibility and efficiency. We obtain very strong results almost immediately, hitting an AUC of $\sim0.94$ within the first dozen experiments. This is better than any results obtained through grid-search. Furthermore, we get within $0.005$ of our optimal AUC within 50 experiments.
We note that the upper boundary of the gamma parameter has a relatively high marginal distribution. As such we might consider increasing this bound for future work.
The sampled parameter posterior is sporadic, complicated and multi-modal. There appears to be no obvious linear effects or patterns. Additionally, the model appears to be highly sensitive to the parameters. All of these factors ensure that less intelligent parameter selection techniques are less than ideal. Within this context, the results of the Bayesian optimisation model are very impressive.
\section{Conclusion}
Our implementation of Bayeisan logit performs substantially less well on this problem than an SGD approach. In contrast, the Bayesian optimisation method performed phenomenally. We avoid making too many direct comparisons to grid-search, mainly because a brute-grid is far from state of the art. Nevertheless, we believe our experiments have shown this hyper-parameter routine to be useful and effective in and of itself.
The combination of kernel-approximator and lightweight classifier is a novel approach with strong experimental results. We believe that Bayesian hyper-parameter optimisation is a powerful complement and augmentation. It solves the issues of domain-specificity and hyper-parameter sensitivity. The efficacy and generalisability of this combination is incredibly exciting, and we foresee a wide range of applications in the future.
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{plain}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}