Authorea

Xavier Holt edited subsection_Prior_for_Weights_In__.tex over 8 years ago

Commit id: 72c0ed1962e4854ae16df659edd4642ad3799d25

deletions | additions

\begin{align} \hat{\mathbf{w}} &= \text{argmax}_\mathbf{w} \log \left( L(\mathbf{w} \mid \mathcal{D})\times p(\mathbf{w} \mid \boldsymbol{\sigma} )\right)\\ & = \text{argmax}_\mathbf{w} \log(L(\mathbf{w} \mid \mathcal{D})) + \log p(\mathbf{w} \mid \boldsymbol{\sigma})\\ & := \text{argmax}_\mathbf{w} \mathcal{l}(\mathbf{w} \mid \mathcal{D}) + - (\mathbf{w}^T \boldsymbol{\sigma^{-1}})^2 \end{align} As mentioned, this has some reasonable qualities and has been shown to perform quite well \cite{Zhang_2003,fan2003loss}. The downside is that, while the Guassian prior favours values particularly close to zero, it does not significantly favour parameters being exactly equal to zero. This is of particular relevance to the problem at hand, as our ability to define an arbitrary number of features and hence dimensions means that it would be beneficial to enforce some sparsity on the weights. It turns out that regularisers that penalise proportional to the $\mathcal{l}^1$ norm of the weights, instead of $\mathcal{l}^2$ norm, achieve this result. To this end, we make the following observation: