--- title: "Stat 8054 Lecture Notes: Slack Variables" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: number_sections: true md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML css: "../bar.css" pdf_document: number_sections: true md_extensions: -tex_math_single_backslash --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `glpkAPI` package used to make this document is `r packageVersion("glpkAPI")`. * The version of the `Matrix` package used to make this document is `r packageVersion("Matrix")`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. ```{r libraries} library(Matrix) library(glpkAPI) ``` # Motivation A student asked in class discussion "Are there analogs of isotonic regression to regressing on the median or other quantiles?" # Slack Variables A "slack variable" is a nonnegative variable introduced to allow functions with discontinuous derivatives to be represented in the form required by common optimization software. For example, the absolute value function can be represented by replacing one variable with two. Instead of $\lvert x \rvert$ we write $x_1 + x_2$ where $x = x_1 - x_2$ and both $x_1$ and $x_2$ are required to be nonnegative. # Isotonic L1 Regression We take for our example a simplified version of the isotonic regression problem where there are no repeated predictor values. This can be generalized to handle repeated predictor values as discussed in the isotonic regression handout. \begin{align*} \text{minimize } & \sum_{i = 1}^n \lvert y_i - \mu_i \rvert \\ \text{subject to } & \mu_1 \le \mu_2 \le \cdots \le \mu_n \end{align*} And we reformulate this using slack variables in the obvious way (there may be a nonobvious way to reformulate it using only half as many slack variables but I cannot be bothered to think it up). \begin{alignat*}{2} \text{minimize } & \sum_{i = 1}^n (u_i + v_i) \\ \text{subject to } & u_i \ge 0, & \qquad & i = 1, \ldots, n \\ & v_i \ge 0, & & i = 1, \ldots, n \\ & y_i - \mu_i = u_i - v_i, & & i = 1, \ldots, n \\ & \mu_i \le \mu_{i + 1} & & i = 1, \ldots, n - 1 \end{alignat*} # Try It ## Data Make up some data ```{r data} set.seed(42) n <- 30 x <- seq(1, 3, length = 30) y <- rexp(n, 1 / x) ``` ## Linear Programming Now we set up our problem as linear programming, as discussed in the [preceding section](#isotonic-regression). Our variables are $u_i$, $v_i$, and $\mu_i$ for $i = 1$, $\ldots,$ $n$ but linear programming treats them as components of one state vector. So we have $3 n$ variables $n$ equality constraints and $3 n - 1$ inequality constraints. The linear programming software we are going to use allows solves problems of the form \begin{align*} \text{minimize } & g^T x \\ \text{subject to } & a \le x \le b \\ & c \le M x \le d \end{align*} where $x$ is the variable which we are optimizing and everything else is a constant ($a$, $b$, $c$, and $d$ are vectors and $M$ is a matrix). The lower bound vectors $a$ and $c$ can have components set to `-Inf` to allow for unbounded variables and the upper bound vectors $b$ and $d$ can have components set to `Inf` to allow for unbounded variables. Equality constraints are expressed in the scheme above by setting the lower bound and upper bound the same. We make this specific for our problem as follows \begin{alignat*}{2} \text{minimize } & \sum_{i = 1}^n (u_i + v_i) \\ \text{subject to } & -\infty \le \mu_i < \infty, & \qquad & i = 1, \ldots, n \\ & 0 \le u_i \le \infty, & & i = 1, \ldots, n \\ & 0 \le v_i \le \infty, & & i = 1, \ldots, n \\ & y_i = \mu_i + u_i - v_i, & & i = 1, \ldots, n \\ & 0 \le \mu_{i + 1} - \mu_i \le \infty & & i = 1, \ldots, n - 1 \end{alignat*} So among the $3 n - 1$ inequality constraints we have $2 n$ bound constraints and $n - 1$ general inequality constraints. We will have the variables (components of $x$) be the $\mu_i$, the $u_i$, and the $v_i$ in that order. ```{r setup} gg <- c(rep(0, n), rep(1, 2 * n)) aa <- c(rep(-Inf, n), rep(0, 2 * n)) bb <- rep(Inf, 3 * n) cc <- c(y, rep(0, n - 1)) dd <- c(y, rep(Inf, n - 1)) idmat <- diag(n) mm <- cbind(idmat, idmat, - idmat) mm2 <- matrix(0, n - 1, 3 * n) mm2[row(mm2) == col(mm2)] <- -1 mm2[row(mm2) + 1 == col(mm2)] <- 1 mm <- rbind(mm, mm2) ``` Now we are set up, math objects $g$, $a$, $b$, $c$, $d$, and $M$ are R objects `gg`, `aa`, `bb`, `cc`, `dd`, and `mm` (we doubled the letters to avoid the name of R function `c`). ```{r load.lp} lp <- initProbGLPK() addRowsGLPK(lp, nrow(mm)) addColsGLPK(lp, ncol(mm)) # STFU setSimplexParmGLPK(MSG_LEV, GLP_MSG_OFF) # column bounds: free variables idx <- which(aa == -Inf & bb == Inf) setColsBndsGLPK(lp, idx, aa[idx], bb[idx], rep(GLP_FR, length(idx))) # column bounds: other variables idx <- which(aa == 0 & bb == Inf) setColsBndsGLPK(lp, idx, aa[idx], bb[idx], rep(GLP_LO, length(idx))) # row bounds: fixed variables idx <- which(cc == dd) setRowsBndsGLPK(lp, idx, cc[idx], dd[idx], rep(GLP_FX, length(idx))) # row bounds: other variables idx <- which(cc < dd) setRowsBndsGLPK(lp, idx, cc[idx], dd[idx], rep(GLP_LO, length(idx))) # objective function setObjDirGLPK(lp, GLP_MIN) idx <- seq_along(gg) setObjCoefsGLPK(lp, idx, gg) # constraint matrix mm <- Matrix(mm) qux <- mat2triplet(mm) idx <- qux$x != 0 loadMatrixGLPK(lp, sum(idx), qux$i[idx], qux$j[idx], qux$x[idx]) ``` The types of variables are * `GLP_FR` free, unbounded in both directions, * `GLP_FX` fixed, equal to the lower bound (upper bound ignored), * `GLP_LO` lower bound only (no upper bound), * `GLP_UP` upper bound only (no lower bound), and * `GLP_DB` double bound (lower bound and upper bound both used). Also note that this packages uses sparse arithmetic for the constraint matrix. If we had used sparse arithmetic all the way through (using R package `Matrix`), then we could have done a very large problem with millions of variables without using more memory than in a laptop. The fact that the constraint matrix has $3 n (n - 1)$ components would not matter. It has only $3 n + 2 (n - 1)$ nonzero components. Now we have loaded the data into the format required by this package. And we are ready to solve. ```{r solve.lp,error=TRUE} solveSimplexGLPK(lp) getPrimStatGLPK(lp) == GLP_FEAS ``` `TRUE` means solution status optimal. ```{r solution} x <- getColsPrimGLPK(lp) ``` Only the first $n$ components of `x` are interesting. ```{r solution.mu} mu <- x[1:n] ``` Let's take a look. ```{r plot.solution,fig.align="center",fig.cap="Scatterplot with Isotonic L1 Regression. Hollow dots observed, solid dots fitted."} plot(y) points(mu, pch = 19) ``` ## Kuhn-Tucker Conditions ### Primal Feasibility First get the rest of the variables. ```{r u.and.v} u <- x[(n + 1):(2 * n)] v <- x[(2 * n + 1):(3 * n)] ``` And check the constraints on them. ```{r primal-feasibility} all(u >= 0) all(v >= 0) all(diff(mu) >= 0) all(y == mu + u - v) max(abs(y - (mu + u - v))) ``` OK. Primal feasibility checks, to within the accuracy of computer arithmetic. Note that the test failed when we naively expected the computer to have real real numbers. But it is clear that $`r max(abs(y - (mu + u - v)))`$ is just rounding error (inexactness of computer arithmetic). We might have had to make similar modifications to the other tests, but the naive approach just happened to work. ### Dual Feasibility Now get Lagrange multipliers. These are called dual variables by the linear programming software. ```{r lagrange} lambda.row <- getRowsDualGLPK(lp) lambda.col <- getColsDualGLPK(lp) length(lambda.row) == nrow(mm) length(lambda.col) == ncol(mm) ``` We are not sure what sign conventions the linear programming software is using, so we just look at what we got. ```{r lagrange.check} lambda.col.mu <- lambda.col[1:n] lambda.col.u <- lambda.col[(n + 1):(2 * n)] lambda.col.v <- lambda.col[(2 * n + 1):(3 * n)] range(lambda.col.mu) range(lambda.col.u) range(lambda.col.v) ``` Actually the `lambda.col.mu` are not Lagrange multipliers because they do not correspond to constraints (the $\mu_i$ are unbounded). This checks if the rule is that Lagrange multiplers for lower bound constraints are nonnegative. We have no upper bound constraints in this part. ```{r lagrange.check.too} nrow(mm) == 2 * n - 1 lambda.row.y <- lambda.row[1:n] lambda.row.mu <- lambda.row[- (1:n)] range(lambda.row.mu) range(lambda.row.y) ``` The check for `lambda.row.y` is OK. These are for the equality constraints $y = \mu + u - v$, and Lagrange multipliers for equality constraints can be either sign. The way we wrote the $\mu$ constraints $\mu_{i + 1} - \mu_i \ge 0$, these are again lower bound constraints, so if the rule we guessed above is correct, then these are also correct (all nonnegative). Since we have no upper bound constraints in this part either, we are not going to learn the rule for upper bound constraints, unless we rewrite the problem. ### Complementary Slackness ```{r complementary} range(lambda.col.mu * mu) range(lambda.col.u * u) range(lambda.col.v * v) range(lambda.row.y * (y - (mu + u - v))) range(lambda.row.mu * diff(mu)) ``` Everything checks. All zero up to accuracy of computer arithmetic. ### Lagrangian Derivative Zero The gradient vector of the objective function is `gg` so the derivative evaluated at the solution is ```{r deriv} sum(gg * x) ``` After some trial and error, we decide the Lagrangian is $$ \mathcal{L} = \left[ \sum_{i = 1}^n (u_i + v_i) \right] - \lambda_u^T u - \lambda_v^T v - \lambda_y^T (\mu + u - v - y) - \sum_{i = 1}^{n - 1} \lambda_\mu(i) (\mu_{i + 1} - \mu_i) $$ and the derivatives are \begin{align*} \partial \mathcal{L} / \partial \mu_i & = - \lambda_y(i) - \lambda_\mu(i - 1) + \lambda_\mu(i) \\ \partial \mathcal{L} / \partial u_i & = 1 - \lambda_u(i) - \lambda_y(i) \\ \partial \mathcal{L} / \partial v_i & = 1 - \lambda_v(i) + \lambda_y(i) \end{align*} where we define $\lambda_\mu(0) = \lambda_\mu(n) = 0$ in order not to have to special case the first equation for $i = 1$ and $i = n$. So the checks are ```{r lagrangian} range(- lambda.row.y - c(0, lambda.row.mu) + c(lambda.row.mu, 0)) range(1 - lambda.col.u - lambda.row.y) range(1 - lambda.col.v + lambda.row.y) ``` So everything checks. We have proved this is the unique global optimum (because this is a convex problem).