lbfgs: Efficient L-BFGS Optimization in R

Abstract

This vignette introduces the **lbfgs** package for R, which consists of a wrapper built around the libLBFGS optimization library written by Naoaki Okazaki. The lbfgs package implements both the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithms. The L-BFGS algorithm solves the problem of minimizing an objective, given its gradient, by iteratively computing approximations of the inverse Hessian matrix. The OWL-QN algorithm finds the optimum of an objective plus the L1-norm of the problem’s parameters, and can be used to train log-linear models with L1-regularization. The package offers a fast and memory-efficient implementation of these optimization routines, which is particularly suited for high-dimensional problems. The lbfgs package compares favorably with other optimization packages for R in microbenchmark tests.

In this vignette we demonstrate how to use the **lbfgs** R package. ^{1} The original motivation for writing the package was the lack of a general R implementation of the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithm (Andrew 2007), except in limited-scope contexts. The package uses the libLBFGS C++ library written by Naoaki Okazaki, which itself is a port of the Fortran method by Jorge Nocedal (Nocedal 1980). The linkage between R and C++ is achieved using Rcpp (Eddelbuettel 2013). The package is suitable for large-scale applications with significant numbers of parameters.

Throughout this vignette, we will be using the following notation, which largely follows Andrew and Gao (2007). Let \(f: \mathbb{R}^\mathbb{n} \mapsto \mathbb{R}\) be the objective function to be minimized. We also let the \(\lvert \lvert \cdot \lvert \lvert\) operator denote the \(L_2\) norm of a vector, and \(\lvert \lvert \cdot \lvert \lvert_1\) denote the \(L_1\) norm. \(\mathbf{B_k}\) is the Hessian matrix (or its approximation) of \(f\) at \(x^k\), and \(g^k\) if the gradient of \(f\) at the same point. We also let \(\mathbf{H_k} = \mathbf{B_k}^{-1}\) be the inverse of the (approximated) Hessian matrix.

The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Liu 1989) is employed for solving high-dimensional minimization problems in scenarios where both the objective function and its gradient can be computed analytically. The L-BFGS algorithm belongs to the class of quasi-Newton optimization routines, which solve the given minimization problem by computing approximations to the Hessian matrix of the objective function. At each iteration, quasi-Newton algorithms locally model \(f\) at the point \(x^k\) using a quadratic approximation: \[Q(x) = f(x^k) + (x - x^k)^T g^k + \frac{1}{2} (x - x^k)^T \mathbf{B_k} (x - x^k)\]

A search direction can then be found by computing the vector \(x^*\) that minimizes \(Q(x)\). Assuming that Hessian is positive-definite, this is \(x^* = x^k - \mathbf{H_k} g^k\). The next search point is then found along the ray defined by \( x^k - \alpha \mathbf{H_k} g^k\). The procedure is iterated until the gradient is zero, with some degree of convergence tolerance. In order to optimize memory usage, the L-BFGS algorithm avoids storing the sequential approximations of the Hessian matrix. Instead, L-BFGS stores curvature information from the last \(m\) iterations of the algorithm, and uses them to find the new search direction. More specifically, the algorithm stores information about the spatial displacement and the change in gradient, and uses them to estimate a search direction without storing or computing the Hessian explicitly.

The L-BFGS method cannot be applied to problems with an objective function of the form \(r(x) = C \cdot \lvert \lvert x \lvert \lvert _1 = C \cdot \sum_i \lvert x_i \lvert\), such as LASSO regression or \(L_1\)-penalized log-linear models, given the non-differentiability of the objective function at any point where at least one of the parameters is zero. The OWL-QN algorithm exploits the fact that \(L_1\)-regularized objective functions will still be differential in any given orthant of the functional space. At each iteration, the algorithm chooses an orthant within which to evalute the function by estimating the sign of each of its parameters. The algorithm then constructs a quadratic approximation to the function in the given orthant using a regular L-BFGS procedure, and searches in the direction of the minimum of the approximation within the same orthant. For further details regarding OWL-QN, we refer the interested reader to the original article (Andrew 2007).

The lbfgs package is intended as a general-purpose library for numerical optimization with L-BFGS and OWL-QN. As such, its syntax and usage closely mirror those of other popular packages for numerical optimization in R. The following list provides brief comparisons between lbfsg and several other packages:

optim(x): The lbfgs package can be used as a drop-in replacement for the L-BFGS-B method of the optim and optimx, with performance improvements on particular classes of problems (see Section ...), especially if lbfgs is used in conjuction with C++ implementations of the objective and gradient functions. In addition, the possibility of introducing \(L_1\) penalization of the objective function allows for solution vectors with much higher sparsity, as most of the otherwise quasi-zero parameters are driven to zero.

penalized: The penalized package fits regression models with both \(L_1\) (lasso and fused lasso) and \(L_2\) (ridge) penalizations. TODO.

glmnet: The glmnet package fits lasso and elastic-net generalized linear models. TODO.

We are extremely thankful to Dustin Tingley for his advice and support.↩

We begin by using lbfgs to minimize a suite of simple test functions, and benchmarking the package against the L-BFGS-B optim method.

*The Rosenbrock function:*We define the Rosenbrock function mapping \(\mathbf{R}^2\) to \(\mathbf{R}\) as \(f(x,y) = 100 \cdot (y - x^2)^2 + (1 - x)^2\). The function has a global minimum at \((0,0)\) that lies within a a long, flat valley. We define the function and its gradient as R objects, and then run the optimization routine using both lbfgs and optim:`objective <- function(x) { 100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2 } gradient <- function(x) { c(-400 * x[1] * (x[2] - x[1]^2) - 2 * (1 - x[1]), 200 * (x[2] - x[1]^2)) } out.lbfgs <- lbfgs(objective, gradient, c(-1.2, 1)) out.optim <- optim(c(-1.2, 1), objective, gradient)`

The results are the following:

`> out.lbfgs $value [1] 3.545445e-13 $par [1] 1.000001 1.000001 $convergence [1] 0 > out.optim $par [1] 0.9999997 0.9999995 $value [1] 2.267577e-13 $counts function gradient 47 47 $convergence [1] 0 $message [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"`

The results are essentially the same, but the lbfgs

The package supports the implementation of the objective and gradient functions in C++, which may yield significant speed improvements over the respective R implementations. The optimization routine’s API accepts both R function objects and external pointers to compiled C++ functions. In order to be compatible with the lbfgs API, the C++ functions **must** return an object of type Rcpp::NumericVector, and take in either one or two objects of type SEXP. The first argument of type SEXP must be the pointer to an R numerical vector containing the values of the function’s parameters. The second (optional) argument must be the pointer to an R environment holding all extra parameters to be fed into the objective and gradient functions. To perform optimization on the Rosenbrock function, we begin by defining the C++ implementations of the objective and of the gradient as character strings, using the RCPP library:

```
objective.include <- 'Rcpp::NumericVector rosenbrock(SEXP xs) {
Rcpp::NumericVector x(xs);
double x1 = x[0];
double x2 = x[1];
double sum = 100 * (x2 - x1 * x1) * (x2 - x1 * x1) + (1 - x1) * (1 - x1);
Rcpp::NumericVector out(1);
out[0] = sum;
return(out);
}
'
gradient.include <- 'Rcpp::NumericVector rosengrad(SEXP xs) {
Rcpp::NumericVector x(xs);
double x1 = x[0];
double x2 = x[1];
double g1 = -400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1);
double g2 = 200 * (x2 - x1 * x1);
Rcpp::NumericVector out(2);
out[0] = g1;
out[1] = g2;
return(out);
}'
```

Then we assign two character strings with the bodies of two functions to generate external pointers to the objective and the gradient:

```
objective.body <- '
typedef Rcpp::NumericVector (*funcPtr)(SEXP);
return(XPtr<funcPtr>(new funcPtr(&rosenbrock)));
'
gradient.body <- '
typedef Rcpp::NumericVector (*funcPtr)(SEXP);
return(XPtr<funcPtr>(new funcPtr(&rosengrad)));
'
```

Finally, we compile this ensemble using the inline package:

```
objective <- cxxfunction(signature(), body=objective.body,
inc=objective.include, plugin="Rcpp")
gradient <- cxxfunction(signature(), body=gradient.body,
inc=gradient.include, plugin="Rcpp")
```

The external pointers to the objective and the gradient generated by the two pointer-assigners can then be supplied to the lbfgs routine:

`out.CPP <- lbfgs(objective(), gradient(), c(-1.2,1), invisible=1)`

We define the same functions in R for comparison purposes:

```
objective.R <- function(x) {
100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
}
gradient.R <- function(x) {
c(-400 * x[1] * (x[2] - x[1]^2) - 2 * (1 - x[1]),
200 * (x[2] - x[1]^2))
}
```

A microbenchmark comparison ^{1} reveals significant speed improvements:

```
microbenchmark(out.CPP <- lbfgs(objective(), gradient(), c(-1.2,1), invisible=1),
out.R <- lbfgs(objective.R, gradient.R, c(-1.2,1), invisible=1))
```

The results are the following, including also the optim routine as a benchmark:

```
Unit: microseconds
expr min lq median uq max neval
lbfgs(objective(), gradient(), c(-1.2, 1), invisible = 1) 79.159 83.3620 88.1875 93.8975 189.992 100
lbfgs(objective.R, gradient.R, c(-1.2, 1), invisible = 1) 275.400 285.6185 303.1600 316.7580 1393.450 100
optim(c(-1.2, 1), objective.R, ..., method = "L-BFGS-B") 366.45 382.2585 415.9205 485.3085 20537.39 100
```

Much like in the R case, the passing of ex

## Share on Social Media