ROUGH DRAFT authorea.com/8851

# lbfgs: Efficient L-BFGS Optimization in R

Abstract

This vignette introduces the lbfgs package for R, which consists of a wrapper built around the libLBFGS optimization library written by Naoaki Okazaki. The lbfgs package implements both the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) and the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithms. The L-BFGS algorithm solves the problem of minimizing an objective, given its gradient, by iteratively computing approximations of the inverse Hessian matrix. The OWL-QN algorithm finds the optimum of an objective plus the L1-norm of the problem’s parameters, and can be used to train log-linear models with L1-regularization. The package offers a fast and memory-efficient implementation of these optimization routines, which is particularly suited for high-dimensional problems. The lbfgs package compares favorably with other optimization packages for R in microbenchmark tests.

# Introduction

In this vignette we demonstrate how to use the lbfgs R package. 1 The original motivation for writing the package was the lack of a general R implementation of the Orthant-Wise Quasi-Newton Limited-Memory (OWL-QN) optimization algorithm (Andrew 2007), except in limited-scope contexts. The package uses the libLBFGS C++ library written by Naoaki Okazaki, which itself is a port of the Fortran method by Jorge Nocedal (Nocedal 1980). The linkage between R and C++ is achieved using Rcpp (Eddelbuettel 2013). The package is suitable for large-scale applications with significant numbers of parameters.

## Notation

Throughout this vignette, we will be using the following notation, which largely follows Andrew and Gao (2007). Let $$f: \mathbb{R}^\mathbb{n} \mapsto \mathbb{R}$$ be the objective function to be minimized. We also let the $$\lvert \lvert \cdot \lvert \lvert$$ operator denote the $$L_2$$ norm of a vector, and $$\lvert \lvert \cdot \lvert \lvert_1$$ denote the $$L_1$$ norm. $$\mathbf{B_k}$$ is the Hessian matrix (or its approximation) of $$f$$ at $$x^k$$, and $$g^k$$ if the gradient of $$f$$ at the same point. We also let $$\mathbf{H_k} = \mathbf{B_k}^{-1}$$ be the inverse of the (approximated) Hessian matrix.

## The L-BFGS Algorithm

The Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm (Liu 1989) is employed for solving high-dimensional minimization problems in scenarios where both the objective function and its gradient can be computed analytically. The L-BFGS algorithm belongs to the class of quasi-Newton optimization routines, which solve the given minimization problem by computing approximations to the Hessian matrix of the objective function. At each iteration, quasi-Newton algorithms locally model $$f$$ at the point $$x^k$$ using a quadratic approximation: $Q(x) = f(x^k) + (x - x^k)^T g^k + \frac{1}{2} (x - x^k)^T \mathbf{B_k} (x - x^k)$

A search direction can then be found by computing the vector $$x^*$$ that minimizes $$Q(x)$$. Assuming that Hessian is positive-definite, this is $$x^* = x^k - \mathbf{H_k} g^k$$. The next search point is then found along the ray defined by $$x^k - \alpha \mathbf{H_k} g^k$$. The procedure is iterated until the gradient is zero, with some degree of convergence tolerance. In order to optimize memory usage, the L-BFGS algorithm avoids storing the sequential approximations of the Hessian matrix. Instead, L-BFGS stores curvature information from the last $$m$$ iterations of the algorithm, and uses them to find the new search direction. More specifically, the algorithm stores information about the spatial displacement and the change in gradient, and uses them to estimate a search direction without storing or computing the Hessian explicitly.

## The OWL-QN Algorithm

The L-BFGS method cannot be applied to problems with an objective function of the form $$r(x) = C \cdot \lvert \lvert x \lvert \lvert _1 = C \cdot \sum_i \lvert x_i \lvert$$, such as LASSO regression or $$L_1$$-penalized log-linear models, given the non-differentiability of the objective function at any point where at least one of the parameters is zero. The OWL-QN algorithm exploits the fact that $$L_1$$-regularized objective functions will still be differential in any given orthant of the functional space. At each iteration, the algorithm chooses an orthant within which to evalute the function by estimating the sign of each of its parameters. The algorithm then constructs a quadratic approximation to the function in the given orthant using a regular L-BFGS procedure, and searches in the direction of the minimum of the approximation within the same orthant. For further details regarding OWL-QN, we refer the interested reader to the original article (Andrew 2007).

## Features of the Package

The lbfgs package is intended as a general-purpose library for numerical optimization with L-BFGS and OWL-QN. As such, its syntax and usage closely mirror those of other popular packages for numerical optimization in R. The following list provides brief comparisons between lbfsg and several other packages:

• optim(x): The lbfgs package can be used as a drop-in replacement for the L-BFGS-B method of the optim and optimx, with performance improvements on particular classes of problems (see Section ...), especially if lbfgs is used in conjuction with C++ implementations of the objective and gradient functions. In addition, the possibility of introducing $$L_1$$ penalization of the objective function allows for solution vectors with much higher sparsity, as most of the otherwise quasi-zero parameters are driven to zero.

• penalized: The penalized package fits regression models with both $$L_1$$ (lasso and fused lasso) and $$L_2$$ (ridge) penalizations. TODO.

• glmnet: The glmnet package fits lasso and elastic-net generalized linear models. TODO.

1. We are extremely thankful to Dustin Tingley for his advice and support.

# Examples

## Simple test functions

We begin by using lbfgs to minimize a suite of simple test functions, and benchmarking the package against the L-BFGS-B optim method.

• The Rosenbrock function: We define the Rosenbrock function mapping $$\mathbf{R}^2$$ to $$\mathbf{R}$$ as $$f(x,y) = 100 \cdot (y - x^2)^2 + (1 - x)^2$$. The function has a global minimum at $$(0,0)$$ that lies within a a long, flat valley. We define the function and its gradient as R objects, and then run the optimization routine using both lbfgs and optim:

objective <- function(x) {
100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
}

c(-400 * x[1] * (x[2] - x[1]^2) - 2 * (1 - x[1]),
200 * (x[2] - x[1]^2))
}

out.lbfgs <- lbfgs(objective, gradient, c(-1.2, 1))
out.optim <- optim(c(-1.2, 1), objective, gradient)

The results are the following:

> out.lbfgs
$value [1] 3.545445e-13$par
[1] 1.000001 1.000001

$convergence [1] 0 > out.optim$par
[1] 0.9999997 0.9999995

$value [1] 2.267577e-13$counts
47       47

$convergence [1] 0$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

The results are essentially the same, but the lbfgs

# Implementing the Objective and Gradient in C++ for Faster Performance

## The Basics

The package supports the implementation of the objective and gradient functions in C++, which may yield significant speed improvements over the respective R implementations. The optimization routine’s API accepts both R function objects and external pointers to compiled C++ functions. In order to be compatible with the lbfgs API, the C++ functions must return an object of type Rcpp::NumericVector, and take in either one or two objects of type SEXP. The first argument of type SEXP must be the pointer to an R numerical vector containing the values of the function’s parameters. The second (optional) argument must be the pointer to an R environment holding all extra parameters to be fed into the objective and gradient functions. To perform optimization on the Rosenbrock function, we begin by defining the C++ implementations of the objective and of the gradient as character strings, using the RCPP library:

objective.include &lt;- 'Rcpp::NumericVector rosenbrock(SEXP xs) {
Rcpp::NumericVector x(xs);
double x1 = x[0];
double x2 = x[1];
double sum = 100 * (x2 - x1 * x1) * (x2 - x1 * x1)  + (1 - x1) * (1 - x1);
Rcpp::NumericVector out(1);
out[0] = sum;
return(out);
}
'

Rcpp::NumericVector x(xs);
double x1 = x[0];
double x2 = x[1];
double g1 = -400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1);
double g2 = 200 * (x2 - x1 * x1);
Rcpp::NumericVector out(2);
out[0] = g1;
out[1] = g2;
return(out);
}' 

Then we assign two character strings with the bodies of two functions to generate external pointers to the objective and the gradient:

objective.body &lt;- '
typedef Rcpp::NumericVector (*funcPtr)(SEXP);
return(XPtr&lt;funcPtr&gt;(new funcPtr(&amp;rosenbrock)));
'

typedef Rcpp::NumericVector (*funcPtr)(SEXP);
' 

Finally, we compile this ensemble using the inline package:

objective &lt;- cxxfunction(signature(), body=objective.body,
inc=objective.include, plugin="Rcpp")

inc=gradient.include, plugin="Rcpp") 

The external pointers to the objective and the gradient generated by the two pointer-assigners can then be supplied to the lbfgs routine:

out.CPP &lt;- lbfgs(objective(), gradient(), c(-1.2,1), invisible=1)

We define the same functions in R for comparison purposes:

objective.R &lt;- function(x) {
100 * (x[2] - x[1]^2)^2 + (1 - x[1])^2
}

c(-400 * x[1] * (x[2] - x[1]^2) - 2 * (1 - x[1]),
200 * (x[2] - x[1]^2))
} 

A microbenchmark comparison 1 reveals significant speed improvements:

microbenchmark(out.CPP &lt;- lbfgs(objective(), gradient(), c(-1.2,1), invisible=1),
out.R &lt;- lbfgs(objective.R, gradient.R, c(-1.2,1), invisible=1)) 

The results are the following, including also the optim routine as a benchmark:

Unit: microseconds
expr     min       lq       median       uq      max neval
lbfgs(objective(), gradient(), c(-1.2, 1), invisible = 1)  79.159  83.3620      88.1875  93.8975  189.992   100
lbfgs(objective.R, gradient.R, c(-1.2, 1), invisible = 1) 275.400 285.6185     303.1600 316.7580 1393.450   100
optim(c(-1.2, 1), objective.R, ..., method = "L-BFGS-B")  366.45 382.2585     415.9205 485.3085 20537.39   100 

## Extra Parameters in C++ Implementations

Much like in the R case, the passing of ex