authorea.com/8851

lbfgs: Efficient L-BFGS Optimization in R

Abstract

This vignette introduces the **lbfgs** package for R, which consists of a wrapper built around the libLBFGS optimization library written by Naoaki Okazaki. The lbfgs package implements both the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithms. The L-BFGS algorithm solves the problem of minimizing an objective, given its gradient, by iteratively computing approximations of the inverse Hessian matrix. The OWL-QN algorithm finds the optimum of an objective plus the L1-norm of the problem’s parameters, and can be used to train log-linear models with L1-regularization. The package offers a fast and memory-efficient implementation of these optimization routines, which is particularly suited for high-dimensional problems. The lbfgs package compares favorably with other optimization packages for R in microbenchmark tests.

In this vignette we demonstrate how to use the **lbfgs** R package. ^{1} The original motivation for writing the package was the lack of a general R implementation of the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithm (Andrew 2007), except in limited-scope contexts. The package uses the libLBFGS C++ library written by Naoaki Okazaki, which itself is a port of the Fortran method by Jorge Nocedal (Nocedal 1980). The linkage between R and C++ is achieved using Rcpp (Eddelbuettel 2013). The package is suitable for large-scale applications with significant numbers of parameters.

Throughout this vignette, we will be using the following notation, which largely follows Andrew and Gao (2007). Let \(f: \mathbb{R}^\mathbb{n} \mapsto \mathbb{R}\) be the objective function to be minimized. We also let the \(\lvert \lvert \cdot \lvert \lvert\) operator denote the \(L_2\) norm of a vector, and \(\lvert \lvert \cdot \lvert \lvert_1\) denote the \(L_1\) norm. \(\mathbf{B_k}\) is the Hessian matrix (or its approximation) of \(f\) at \(x^k\), and \(g^k\) if the gradient of \(f\) at the same point. We also let \(\mathbf{H_k} = \mathbf{B_k}^{-1}\) be the inverse of the (approximated) Hessian matrix.

The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Liu 1989) is employed for solving high-dimensional minimization problems in scenarios where both the objective function and its gradient can be computed analytically. The L-BFGS algorithm belongs to the class of quasi-Newton optimization routines, which solve the given minimization problem by computing approximations to the Hessian matrix of the objective function. At each iteration, quasi-Newton algorithms locally model \(f\) at the point \(x^k\) using a quadratic approximation: \[Q(x) = f(x^k) + (x - x^k)^T g^k + \frac{1}{2} (x - x^k)^T \mathbf{B_k} (x - x^k)\]

A search direction can then be found by computing the vector \(x^*\) that minimizes \(Q(x)\). Assuming that Hessian is positive-definite, this is \(x^* = x^k - \mathbf{H_k} g^k\). The next search point is then found along the ray defined by \( x^k - \alpha \mathbf{H_k} g^k\). The procedure is iterated until the gradient is zero, with some degree of convergence tolerance. In order to optimize memory usage, the L-BFGS algorithm avoids storing the sequential approximations of the Hessian matrix. Instead, L-BFGS stores curvature information from the last \(m\) iterations of the algorithm, and uses them to find the new search direction. More specifically, the algorithm stores information about the spatial displacement and the change in gradient, and uses them to estimate a search direction without storing or computing the Hessian explicitly.

The L-BFGS method cannot be applied to problems with an objective function of the form \(r(x) = C \cdot \lvert \lvert x \lvert \lvert _1 = C \cdot \sum_i \lvert x_i \lvert\), such as LASSO regression or \(L_1\)-penalized log-linear models, given the non-differentiability of the objective function at any point where at least one of the parameters is zero. The OWL-QN algorithm exploits the fact that \(L_1\)-regularized objective functions will still be differential in any given orthant of the functional space. At each iteration, the algorithm chooses an orthant within which to evalute the function by estimating the sign of each of its parameters. The algorithm then constructs a quadratic approximation to the function in the given orthant using a regular L-BFGS procedure, and searches in the direction of the minimum of the approximation within the same orthant. For further details regarding OWL-QN, we refer the interested reader to the original article (Andrew 2007).

The lbfgs package is intended as a general-purpose library for numerical optimization with L-BFGS and OWL-QN. As such, its syntax and usage closely mirror those of other popular packages for numerical optimization in R. The following list provides brief comparisons between lbfsg and several other packages:

optim(x): The lbfgs package can be used as a drop-in replacement for the L-BFGS-B method of the optim and optimx, with performance improvements on particular classes of problems (see Section ...), especially if lbfgs is used in conjuction with C++ implementations of the objective and gradient functions. In addition, the possibility of introducing \(L_1\) penalization of the objective function allows for solution vectors with much higher sparsity, as most of the otherwise quasi-zero parameters are driven to zero.

penalized: The penalized package fits regression models with both \(L_1\) (lasso and fused lasso) and \(L_2\) (ridge) penalizations. TODO.

glmnet: The glmnet package fits lasso and elastic-net generalized linear models. TODO.

We are extremely thankful to Dustin Tingley for his advice and support.↩

We begin by using lbfgs to minimize a suite of simple test functions, and benchmarking the package against the L-BFGS-B optim method.

*The Rosenbrock function:*We define the Rosenbrock function mapping \(\mathbf{R}^2\) to \(\mathbf{R}\) as \(f(x,y) = 100 \cdot (y - x^2)^2 + (1 - x)^2\). The function has a global minimum at \((0,0)\) that lies within a a long, flat valley. We define the function and its gradient as R objects, and then run the optimization routine using both lbfgs and optim:`objective <- function(x) { 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2 } gradient <- function(x) { c(-400 * x[1] * (x[2] - x[1]^2) - 2 * (1 - x[1]), 200 * (x[2] - x[1]^2)) } out.lbfgs <- lbfgs(objective, gradient, c(-1.2, 1)) out.optim <- optim(c(-1.2, 1), objective, gradient)`

The results are the following:

`> out.lbfgs $value [1] 3.545445e-13 $par [1] 1.000001 1.000001 $convergence [1] 0 > out.optim $par [1] 0.9999997 0.9999995 $value [1] 2.267577e-13 $counts function gradient 47 47 $convergence [1] 0 $message [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"`

The results are essentially the same, but the lbfgs

## Share on Social Media