lbfgs: Efficient L-BFGS Optimization in R

Abstract

This vignette introduces the **lbfgs** package for R, which consists of a wrapper built around the libLBFGS optimization library written by Naoaki Okazaki. The lbfgs package implements both the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithms. The L-BFGS algorithm solves the problem of minimizing an objective, given its gradient, by iteratively computing approximations of the inverse Hessian matrix. The OWL-QN algorithm finds the optimum of an objective plus the L1-norm of the problem’s parameters, and can be used to train log-linear models with L1-regularization. The package offers a fast and memory-efficient implementation of these optimization routines, which is particularly suited for high-dimensional problems. The lbfgs package compares favorably with other optimization packages for R in microbenchmark tests.

In this vignette we demonstrate how to use the **lbfgs** R package. ^{1} The original motivation for writing the package was the lack of a general R implementation of the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithm (Andrew 2007), except in limited-scope contexts. The package uses the libLBFGS C++ library written by Naoaki Okazaki, which itself is a port of the Fortran method by Jorge Nocedal (Nocedal 1980). The linkage between R and C++ is achieved using Rcpp (Eddelbuettel 2013). The package is suitable for large-scale applications with significant numbers of parameters.

Throughout this vignette, we will be using the following notation, which largely follows Andrew and Gao (2007). Let \(f: \mathbb{R}^\mathbb{n} \mapsto \mathbb{R}\) be the objective function to be minimized. We also let the \(\lvert \lvert \cdot \lvert \lvert\) operator denote the \(L_2\) norm of a vector, and \(\lvert \lvert \cdot \lvert \lvert_1\) denote the \(L_1\) norm. \(\mathbf{B_k}\) is the Hessian matrix (or its approximation) of \(f\) at \(x^k\), and \(g^k\) if the gradient of \(f\) at the same point. We also let \(\mathbf{H_k} = \mathbf{B_k}^{-1}\) be the inverse of the (approximated) Hessian matrix.

The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Liu 1989) is employed for solving high-dimensional minimization problems in scenarios where both the objective function and its gradient can be computed analytically. The L-BFGS algorithm belongs to the class of quasi-Newton optimization routines, which solve the given minimization problem by computing approximations to the Hessian matrix of the objective function. At each iteration, quasi-Newton algorithms locally model \(f\) at the point \(x^k\) using a quadratic approximation: \[Q(x) = f(x^k) + (x - x^k)^T g^k + \frac{1}{2} (x - x^k)^T \mathbf{B_k} (x - x^k)\]

A search direction can then be found by computing the vector \(x^*\) that minimizes \(Q(x)\). Assuming that Hessian is positive-definite, this is \(x^* = x^k - \mathbf{H_k} g^k\). The next search point is then found along the ray defined by \( x^k - \alpha \mathbf{H_k} g^k\). The procedure is iterated until the gradient is zero, with some degree of convergence tolerance. In order to optimize memory usage, the L-BFGS algorithm avoids storing the sequential approximations of the Hessian matrix. Instead, L-BFGS stores curvature information from the last \(m\) iterations of the algorithm, and uses them to find the new search direction. More specifically, the algorithm stores information about the spatial displacement and the change in gradient, and uses them to estimate a search direction without storing or computing the Hessian explicitly.

The L-BFGS method cannot be applied to problems with an objective function of the form \(r(x) = C \cdot \lvert \lvert x \lvert \lvert _1 = C \cdot \sum_i \lvert x_i \lvert\), such as LASSO regression or \(L_1\)-penalized log-linear models, given the non-differentiability of the objective function at any point where at least one of the parameters is zero. The OWL-QN algorithm exploits the fact that \(L_1\)-regularized objective functions will still be differential in any given orthant of the functional space. At each iteration, the algorithm chooses an orthant within which to evalute the function by estimating the sign of each of its parameters. The algorithm then constructs a quadratic approximation to the function in the given orthant using a