First thoughts
We have a non-conjugate model, which limits the scope of the variational methods we can use. One option is non-parametric VI (NPVI, Gershman), which doesn't require conjugacy and models the posterior as a uniformly weighted mixture of isotropic Gaussians. We might be able to improve on their procedure, it might be sufficient anyway.
VI basics
The objective is to find an approximate posterior distribution \(q(\theta)\) which minimises the KL divergence \(KL[ q||p ]\), with \(p\) the posterior distribution over the parameters given the data.
Now, 
\(KL[q||p] = \int_\Theta q \log\left( \frac{q}{p}\right) \, \rm{d}\theta\)
                        \(= \int_\Theta q \log q \, \rm{d}\theta - \int_\Theta q \log (\pi \mathcal{L}) \, \rm{d}\theta + \int_\Theta q \log p(y) \, \rm{d}\theta\)
                        \(= - \mathcal{H}[q] - \mathbb{E}_q[\log (\pi \mathcal{L})] + \log p(y)\)
So minimising the KL divergence is equivalent to maximising \(\rm{ELBO}[q] = \mathcal{H}[q] + \mathbb{E}_q[\log (\pi \mathcal{L})]\), which we call the evidence lower bound (ELBO) because (since KL divergence is non-negative), the ELBO is a lower bound for the log-evidence \(\log p(y)\).
In NPVI, we choose the variational family to be the set \(\mathcal{Q} = \left\{ q| q = \frac{1}{N} \sum_{i = 1}^{N} q_i; \ q_i \textrm{ is an isotropic Gaussian pdf } \forall i\right\}\) and the objective is to find \(\rm{argmax}_{q \in \mathcal Q} \rm{ELBO}[q]\).
NPVI uses a lower bound on the entropy term \(\mathcal{H}[q]\) and a Taylor approximation to the expectation to reach an approximate ELBO to maximise.
We can write the following:
\(\mathcal{H}[q] + \mathbb{E}_q[\log(\pi \mathcal{L})] = \mathcal H [q] + \mathbb E_q [\log \pi] + \mathbb E_q [\ell]\).
Log-Sobolev inequalities
We can bound the entropy term with a log-Sobolev inequality
\(- \mathcal{H}[q] \leq p \log \left( C \int_\Theta |\nabla q| \, \rm d \theta \right)\) with \(C = \frac{1}{\sqrt{\pi} p} \left( \Gamma \left( \frac{p}{2} + 1 \right) \right)^\frac{1}{p}\) where \(p\) is the dimension of the parameter space. Note that similar inequalities are available for expected entropies under a standard multivariate Gaussian, i.e. \(\int_{\mathbb R^p} q\log q \, \rm d \mu\), with \(\mu\) a standard Gaussian measure (which is the prior...), i.e. this is \(\int_{\mathbb R^p} \pi q \log q \, \rm d \theta = \mathbb E_\pi [q(\theta)\log q(\theta)]\).
We then obtain the following bound on the ELBO
\(\rm{ELBO}[q] \geq \mathbb E_q [\log \pi] + \mathbb E_q[\ell]- p \log \left( C \int_\Theta |\nabla q| \, \rm d \theta \right)\)
The expected prior is easy
Now, \(\mathbb E_q[\log \pi] = \sum_{i = 1}^{N}\omega_n \int_\Theta q_i \log \pi \, \rm d \theta\).
For latent Gaussian models, i.e. where all of the parameters have independent Gaussian priors, we can take \(\pi\) to be a \(p\)-dimensional Gaussian with diagonal covariance, or alternatively the product of \(p\) univariate Gaussians.
Thus the expectation becomes \(\mathbb E_q[\log \pi] = \sum_{n = 1}^{N}\omega_n \int_\Theta q_n \sum_{j = 1}^{p}\log \pi_j \, \rm d \theta\).
If we additionally restrict the variational family to be composed of Gaussians with diagonal covariance matrices, the expectation is
\(\mathbb E_q[\log \pi] = \sum_{n = 1}^{N}\omega_n \int_\Theta \prod_{k = 1}^{p}q_{nk} \sum_{j = 1}^{p}\log \pi_j \, \rm d \theta\)
                        \(= \sum_{n = 1}^{N} \omega_n \sum_{k = 1}^{p} \int_{\Theta_k} q_{nk}(\theta_k) \log(\pi_k(\theta_k)) \, \rm d \theta_k\)
i.e. a weighted average of the sum of the marginal expectations over the mixture components.
Denoting by \((\mu_k,\sigma_k)\) the mean-standard deviation pair for the \(k^{th}\) prior component and by \((\varphi_{nk},\rho_{nk})\) the mean-standard deviation pair for the \(k^{th}\) component of the \(n^{th}\) mixture density, we can write the expectation in closed form as
\(\mathbb E_q[\log \pi] = \sum_{n = 1}^{N}\omega_n \sum_{k = 1}^{p}\frac{\mu_k}{\sigma_k^2}\varphi_{nk} - \frac{1}{2\sigma_k^2}(\varphi_{nk}^2+\rho_{nk}^2) - \mu_k^2 - \log(\sqrt{2\pi}\sigma_k)\). CHECK THIS!
We can now write an approximate ELBO as
\(\rm{ELBO}^*[q] = \sum_{n = 1}^{N}\omega_n \sum_{k = 1}^{p}\left[ \frac{\mu_k}{\sigma_k^2}\varphi_{nk} - \frac{1}{2\sigma_k^2}(\varphi_{nk}^2+\rho_{nk}^2) - \mu_k^2 - \log(\sqrt{2\pi}\sigma_k)\right] + \mathbb E_q[\ell]- p \log \left( C \int_\Theta |\nabla q| \, \rm d \theta \right)\).
Back to tricky stuff
It remains to simplify the expected log-likelihood under the variational distribution and the integral arising from the log-Sobolev inequality.
Beginning with the inequality integral \(\int |\nabla q| \, \rm d \theta\):
We have \(q = \sum_{n = 1}^{N} \omega_n \mathcal N(\theta; \mu_n,\Sigma_n) = \sum_{n = 1}^{N} \omega_n \prod_{j=1}^{p}\mathcal N(\theta_j; \mu_{nj},\sigma_{nj}^2)\).
So \(\partial_{\theta_i}q = \sum_{n = 1}^{N} \omega_n \frac{(\mu_{ni} - \theta_i)}{\sigma_{ni}^2} \prod_{j=1}^{p}\mathcal N(\theta_j; \mu_{nj},\sigma_{nj}^2)\).
Thus
\(|\nabla q| = \sqrt{\sum_{k = 1}^{p} (\partial_{\theta_k}q)^2} = \sqrt{\sum_{i = 1}^{p} \left( \sum_{n = 1}^{N} \omega_n \frac{(\mu_{ni} - \theta_i)}{\sigma_{ni}^2} \prod_{j=1}^{p}\mathcal N(\theta_j; \mu_{nj},\sigma_{nj}^2) \right)^2} = \sqrt{\sum_{i = 1}^{p} \left( \sum_{n = 1}^{N} \omega_n \frac{(\mu_{ni} - \theta_i)}{\sigma_{ni}^2} q_n \right)^2}\)
which is not particularly nice at this stage.
We can write this as the norm of a matrix-vector product
\(|\nabla q| = ||A\mathbf q||_2\), with \((A)_{ij} = \omega_j\frac{\mu_{ji} - \theta_i}{\sigma_{ji}^2}\).
i.e. we need to compute
\(\int_\Theta ||A\mathbf q||_2 \, \rm d \theta\); the gradient operator just turns into a matrix multiplication.
So as it stands, we have the following approximate ELBO
\(\rm{ELBO}^*[q] = \frac{\mu_k}{\sigma_k^2}\varphi_{nk} - \frac{1}{2\sigma_k^2}(\varphi_{nk}^2+\rho_{nk}^2) - \mu_k^2 - \log(\sqrt{2\pi}\sigma_k) + \mathbb E_q[\ell]- p \log \left( C \int_\Theta |A\mathbf q| \, \rm d \theta \right)\)
Using a different log-Sobolev inequality, we can write
\(\int_\Theta q \log q \, \rm d \theta \leq \frac{p}{2} \log \left( \frac{1}{2\pi pe} \int_\Theta \frac{|\nabla q|^2}{q} \, \rm d \theta \right)\)
To approximate the expected log-likelihood, we might try a Bayes-Hermite quadrature/Bayesian Monte Carlo, which is a GP-based method of approximating integrals with a closed form solution when the expectation is with respect to a mixture of normals. Alternatively, we can just do the same as Gershman and take a Taylor expansion approximation, but this seems pretty poor.