# 2 R

• The version of R used to make this document is 4.1.2.

• The version of the Iso package used to make this document is 0.0.18.1.

• The version of the rmarkdown package used to make this document is 2.11.

# 3 Lagrange Multipliers

## 3.1 A Theorem

The following theorem originally comes from Shapiro (1979).

Theorem 3.1 Consider the following constrained optimization problem $$$\begin{split} \text{minimize } & f(x) \\ \text{subject to } & g_i(x) = 0, \qquad i \in E, \\ & g_i(x) \le 0, \qquad i \in I, \end{split} \tag{3.1}$$$ where $$E$$ and $$I$$ are disjoint finite sets, and its Lagrangian function $$$L(x, \lambda) = f(x) + \sum_{i \in E \cup I} \lambda_i g_i(x) \tag{3.2}$$$ If there exist $$x^*$$ and $$\lambda$$ such that

1. $$x^*$$ minimizes $$L(\,\cdot\,, \lambda)$$,

2. $$g_i(x^*) = 0$$, $$i \in E$$ and $$g_i(x^*) \le 0$$, $$i \in I$$,

3. $$\lambda_i \ge 0$$, $$i \in I$$, and

4. $$\lambda_i g_i(x^*) = 0$$, $$i \in I$$.

Then $$x^*$$ solves the constrained problem (3.1).

The conditions in the theorem are called

1. Lagrangian minimization,
2. primal feasibility,
3. dual feasibility, and
4. complementary slackness.

Say $$x$$ is feasible if the constraints in (3.1) are satisfied. We say $$\lambda$$ is feasible if the constraints in condition (c) hold. That is, primal feasibility is feasibility of $$x$$, and dual feasibility is feasibility of $$\lambda$$.

The components of $$\lambda$$ are called Lagrange multipliers. From the eponym, this theorem is old. In the case where there are only equality constraints it was invented by Lagrange. Having inequality constraints is a twentieth century innovation.

Proof. By (a) $$$L(x, \lambda) \ge L(x^*, \lambda),$$$ for all $$x$$, which is equivalent to $$$f(x) \ge f(x^*) + \sum_{i \in E \cup I} \lambda_i g_i(x^*) - \sum_{i \in E \cup I} \lambda_i g_i(x)$$$ So for feasible $$x$$ $$$f(x) \ge f(x^*) -\sum_{i \in E \cup I} \lambda_i g_i(x) \ge f(x^*)$$$ The first equality is conditions (b) and (d), and the second inequality is (c) and feasibility of $$x$$.

## 3.2 Kuhn-Tucker Conditions

When condition (a) is replaced by derivative equal to zero, that is $$$\nabla f(x^*) + \sum_{i \in E \cup I} \lambda_i \nabla g_i(x^*) = 0,$$$ then we no longer have a theorem because we know that derivative equal to zero is neither necessary nor sufficient for even a local minimum, much less a global one. Nevertheless, it is the conditions in this form that have an eponym. They are called the Kuhn-Tucker conditions or the Karush-Kuhn-Tucker conditions.

For convex problems the Kuhn-Tucker conditions often guarantee a global minimum. See the section about convex programming below.

# 4 Isotonic Regression

Isotonic regression is the competitor of simple linear regression (univariate response, univariate predictor) when it is assumed that the regression function is monotone rather than linear (of course linear is also monotone, but not vice versa; we are making a more general, weaker assumption).

## 4.1 Homoscedastic Normal Errors

We start with assuming homoscedastic normal errors (as in the usual theory of linear models). It will turn out that exactly the same algorithm that solves this problem also does logistic regression or Poisson regression if link functions are chosen to make the problem exponential family (logit link for binomial response, log link for Poisson response).

As usual, we assume the components $$Y_i$$ of the response vector are random and the components $$x_i$$ of the predictor vector are fixed (if they are random, then we condition on their observed values).

Also, until further notice we assume the distribution of $$Y_i$$ is $$\text{Normal}(\mu_i, \sigma^2)$$, which is the usual linear models assumption. Now the monotonicity constraints are $x_i \le x_j \quad \text{implies} \quad \mu_i \le \mu_j$ The unknown parameters are the vector $$\mu$$ having components $$\mu_i$$ and the scalar $$\sigma^2$$.

What are the maximum likelihood estimates?

## 4.2 Rewriting to Deal with Duplicate Predictor Values

We see that $$x_i = x_j$$ implies $$\mu_i = \mu_j$$. So rewrite the problem so we only have one mean parameter for each unique predictor value.

Let $$z_1$$, $$\ldots,$$ $$z_k$$ be the unique predictor values in sorted order.

Define \begin{align*} w_j & = \sum_{\substack{i = 1 \\ x_i = z_j }}^n 1 \\ V_j & = \frac{1}{w_j} \sum_{\substack{i = 1 \\ x_i = z_j }}^n Y_i \end{align*} Then $V_j \sim \text{Normal}(\nu_j, \sigma^2 / w_j)$ where $\nu_1 \le \nu_2 \le \cdots \le \nu_k$ and where $\nu_j = \mu_i \quad \text{whenever} \quad z_j = x_i$

## 4.3 Minus Log Likelihood

$f(\mu, \sigma^2) = n \log(\sigma) + \frac{1}{2 \sigma^2} \sum_{i = 1}^n (y_i - \mu_i)^2$

## 4.4 Lagrangian

\begin{align*} L(\mu, \sigma^2, \lambda) & = n \log(\sigma) + \frac{1}{2 \sigma^2} \sum_{i = 1}^n (y_i - \mu_i)^2 + \sum_{j = 1}^{k - 1} \lambda_j (\nu_j - \nu_{j + 1}) \end{align*}

## 4.5 Kuhn-Tucker Conditions

We find, as usual in least squares problems, that the equations for the means don’t involve the variance, so our estimates for the means don’t involve the variance.

In particular, $$$\begin{split} \frac{\partial L(\mu, \sigma^2, \lambda)}{\partial \nu_m} & = \lambda_m - \lambda_{m - 1} - \frac{1}{\sigma^2} \sum_{i = 1}^n (y_i - \mu_i) \frac{\partial \mu_i}{\partial \nu_m} \\ & = \lambda_m - \lambda_{m - 1} - \frac{w_m (v_m - \nu_m)}{\sigma^2} \end{split} \tag{4.1}$$$ Note that $$\partial \mu_i / \partial \nu_m$$ is equal to one if $$x_i = z_m$$ and zero otherwise, so the terms in the sum in the first line are nonzero only if $$x_i = z_m$$.

As it stands this formula is only valid for $$1 < m < k$$. In order to make it valid for all $$m$$, we define $$\lambda_0 = \lambda_k = 0$$.

Setting (4.1) equal to zero gives us the first Kuhn-Tucker condition. So set these equal to zero, and multiply through by $$\sigma^2$$ giving

1. $$- w_m (v_m - \nu_m) + \kappa_m - \kappa_{m - 1} = 0$$, $$m = 1$$, $$\ldots,$$ $$k$$

where we have introduced $$\kappa_m = \sigma^2 \lambda_m$$.

Since $$\kappa_m$$ is zero, positive, or negative precisely when $$\lambda_m$$ is, the rest of the Kuhn-Tucker conditions (for the mean parameters) are

1. $$\nu_j \le \nu_{j + 1}$$, $$j = 1$$, $$\ldots,$$ $$k - 1$$.

2. $$\kappa_j \ge 0$$, $$j = 1$$, $$\ldots,$$ $$k - 1$$.

3. $$\kappa_j (\nu_j - \nu_{j + 1}) = 0$$, $$j = 1$$, $$\ldots,$$ $$k - 1$$.

And, we see that, as usual, the equations for maximum likelihood estimation of the mean parameters do not involve $$\sigma^2$$ (after some rewriting).

## 4.6 Blocks

We first consider applying only conditions (a) (derivative of Lagrangian equal to zero) and (b) (complementary slackness).

At any vector $$\nu$$, we divide it up into blocks of equal consecutive components. Such blocks have length one if some $$\nu_j$$ is not equal to the nu on either side.

So consider such a block. Suppose $\nu_{j^* - 1} \neq \nu_{j^*} = \cdots = \nu_{j^{{*}{*}} - 1} \neq \nu_{j^{{*}{*}}}$ where to make this work for the edge cases, we define $$\nu_0 = - \infty$$ and $$\nu_{k + 1} = + \infty$$.

Complementary slackness implies $$\kappa_{j^* - 1} = \kappa_{j^{{*}{*}} - 1} = 0$$. Hence, now using condition (a), \begin{align*} 0 & = \sum_{j = j^*}^{j^{{*}{*}} - 1} \big[ - w_j (v_j - \nu_j) + \kappa_j - \kappa_{j - 1} \bigr] \\ & = - \sum_{j = j^*}^{j^{{*}{*}} - 1} w_j (v_j - \nu) \end{align*} where $$\nu = \nu_{j^*} = \cdots = \nu_{j^{{*}{*}} - 1}$$.

Solving we get $\nu_{j^*} = \cdots = \nu_{j^{{*}{*}} - 1} = \frac{ \sum_{j = j^*}^{j^{{*}{*}} - 1} w_j v_j }{ \sum_{j = j^*}^{j^{{*}{*}} - 1} w_j }$ In summary, applying conditions (a) and (d) only (so far), the mean values in a block of equal means is the weighted average of the $$v_j$$ values, which is the unweighted average of the $$y_i$$ values for the block.

## 4.7 The Pool Adjacent Violators Algorithm (PAVA)

PAVA does the following

1. [Initialize] Set $$\nu$$ and $$\kappa$$ to any values satisfying conditions (a), (c), and (d).

2. [Terminate] If condition (b) is satisfied, stop. [Kuhn-Tucker conditions are satisfied.]

3. [PAV] Choose any $$j$$ such that $$\nu_j > \nu_{j + 1}$$. And “pool” the blocks containing $$j$$ and $$j + 1$$, that is, make them one block (and the nu values for this pooled block will again be the weighted average of vee values for this pooled block). [This step maintains conditions (a), (c), and (d).]

4. Go to step 2.

The initialization step is easy. One way to do it is to set $$\nu = v$$ and $$\kappa = 0$$.

## 4.8 Non-Determinism

PAVA is a non-deterministic algorithm. In step 3, if there is more than one pair of adjacent violators, then any one of them can be chosen to be pooled in that step. The choice can be made in any way: leftmost, rightmost, random, whatever.

## 4.9 Example

To keep things simple we will assume the predictor values are the indices of the response vector (so there are no repeated predictor values and they are in sorted order)

pava <- function(y) {
blocks <- as.list(y)
repeat {
block.means <- sapply(blocks, mean)
block.diffs <- diff(block.means)
if (all(block.diffs >= 0)) return(unlist(blocks))
j <- which(block.diffs < 0)
if (length(j) > 1) {
# non-determinism !!!
# never call R function sample with length of first arg equal to one
# its behavior is one of the bad parts of R
j <- sample(j, 1)
}
violators <- blocks[c(j, j + 1)]
pool.length <- length(unlist(violators))
pool.mean <- mean(unlist(violators))
i <- seq_along(blocks)
blocks <- c(blocks[i < j], list(rep(pool.mean, pool.length)),
blocks[i > j + 1])
}
}

So let’s try it out.

y <- rnorm(20, mean = 1:20, sd = 3)
plot(y)
points(1:20, pava(y), pch = 19)

## 4.16 Philosophy

Note that the isotonic regression assumption is undisputable. The response is Bernoulli, and we know the success probability decreases with distance. Note that we are using mean value parameters to model the Bernoullis, so no assumption is made about link functions. Now consider than any parametric assumptions about the mean function would be highly disputable. And any nonparametric methods (smoothing splines or whatever) would also be disputable. You can do something else, but only isotonic regression is philosophically clean.

# 5 Convex Programming

We say the problem described in the section defining the problem the Lagrange multiplier methodology solves above is a convex programming problem if

• the objective function $$f$$ is convex,

• the constraint functions $$g_i$$ are convex for the inequality constraints (for $$i \in I$$), and

• the constraint functions $$g_i$$ are affine for the equality constraints (for $$i \in E$$).

For a convex programming problem, there is no difference between the conditions in our Theorem 3.1 and the Kuhn-Tucker conditions. This is because of the subgradient inequality. For any convex function $$f$$ and any point $$x$$ where its derivative exists $f(y) \ge f(x) + \langle \nabla f(x), y - x \rangle, \qquad \text{for all y}$ The right-hand side of this equality is the best affine approximation to the function near $$x$$, the Taylor series for $$f$$ expanding around $$x$$ with only constant and linear terms kept. So if $$\nabla f(x) = 0$$, then $$f(y) \ge f(x)$$ for all $$y$$.

For a convex programming problem, the Lagrangian is a convex function of $$x$$. Hence any point where the gradient of the Lagrangian is zero is a global minimizer of the Lagrangian.

A second issue about convex programming problems is whether we are guaranteed that the method of Lagrange multipliers can always find a solution.

Rockafellar (1970) is the authoritative source for the theory of convex optimization. It has multiple theorems about the existence of Lagrange multipliers. One which tells us about isotonic regression is Corollary 28.2.2, which says that if a problem has a solution (the minimum exists) and all of the constraints are affine, then the method of Lagrange multipliers works.

So this tells us that PAVA is guaranteed to find a global maximizer of the log likelihood. Moreover, since the log likelihood is strictly concave, this global maximizer is unique.

# Bibliography

Barlow, R. E., Bartholomew, D. J., Bremner, J. M., et al. (1972) Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. London: Wiley.

Dijkstra, E. W. (1976) A Discipline of Programming. Englewood Cliffs NJ: Prentice-Hall.

Feijen, W. H. J. and van Gasteren, A. J. M. (1999) On a Method of Multiprogramming. New York: Springer.

Misra, J. (2001) A Discipline of Multiprogramming: Programming Theory for Distributed Applications. New York: Springer.

Robertson, T., Wright, F. T. and Dykstra, R. (1988) Order Restricted Statistical Inference. Chichester: Wiley.

Rockafellar, R. T. (1970) Convex Analysis. Princeton: Princeton University Press.

Shapiro, J. F. (1979) Mathematical Programming: Structures and Algorithms. New York: Wiley.