# Least Squares

Least squares is a widely used class of convex optimization that can be applied to solve a variety of problems, for instance: in statistics for curve fitting, in machine learning to compute support vector machines, in robotics to solve inverse kinematics, etc. They are related to quadratic programming (QP), having a slightly more intuitive objective function, and the first step beyond linear programming (LP) in convex optimization.

## Standard form

A constrained weighted linear least squares (LS) problem is written as:

This problem seeks the vector \(x\) of optimization variables such that
\(R x\) is as "close" as possible to a prescribed vector \(s\),
meanwhile satisfying a set of linear inequality and equality constraints
defined by the matrix-vector couples \((G, h)\) and \((A, b)\),
respectively. Vector inequalities apply coordinate by coordinate. How close the
two vectors \(R x\) and \(s\) are is measured via the weighted norm
\(\| y \|_W = \sqrt{y^T W y}\), where the weight matrix \(W\) is
positive semi-definite and usually
diagonal. The overall function \(x \mapsto (1/2) \| R x - s \|^2_W\) is
called the *objective function* of the problem.

## Graphical interpretation

Let us illustrate the intuition behind this in 2D:

In the figure above, the level sets of the weighted norm are represented by ellipses centered on the target \(s\), while the linear set of inequality constraints corresponds to the orange polygon. (There is no equality constraint in this example.) Since the target \(s\) of the objective function is outside of the polygon, the optimum \(x^*\) of the LS lies at the boundary of the linear set.

In this example, the matrix \(R\) of the objective function is the
identity, *i.e.* \(x\) tries to be as "close" as possible to \(s\), and
the norm to measure how close the two vectors are is defined by the weight
matrix \(W = \mathrm{diag}(w_1, w_2)\):

The weights \(w_1\) and \(w_2\) shape the relative importance of proximity along the first coordinate \((x_1 - s_1)^2\) compared to proximity along the second coordinate \((x_2 - s_2)^2\). Note that the absolute value of these weights has no effect on the optimum of our problem: the only thing that matters is their relative value \(w_1 / w_2\). The higher this ration, the more the optimum will trade a larger \((x_2 - s_2)^2\) for a smaller \((x_1 - s_1)^2\).

*Question*: in the figure above, what is the largest weight, \(w_1\) or
\(w_2\)?

## Python example

Suppose we have access to a `solve_ls` function that solves least squares
problems in standard form. For instance, the qpsolvers module provides one directly if
you are using Python. Let us then see how to solve the following LS:

First, we write our LS matrices in proper format:

R = array([[1., 2., 0.], [-8., 3., 2.], [0., 1., 1.]]) s = array([3., 2., 3.]) G = array([[1., 2., 1.], [2., 0., 1.], [-1., 2., -1.]]) h = array([3., 2., -2.]) W = eye(3)

Finally, we compute the solution using the `solve_ls` function:

In [1]: solve_ls(R, s, G, h, W=W) Out[1]: array([ 0.12997347, -0.06498674, 1.74005305])

## Problem shaping

In fields such as optimal control, a lot of human work lies in solving the inverse problem: think of a desired behavior, formulate it as a numerical optimization problem, run the system, realize that the optimal solution to the problem is not what one wanted in the first place, update the problem accordingly, iterate. This process challenges us to get better at problem shaping: how to modify the problem to steer the behavior of its optimum towards what we want?

### Diagonal weight matrix

The first trick in the book is to use a diagonal weight matrix, and play with its weight to adjust the relative importance of one sub-objective, that is, one term \((R_i x - s_i)^2\) of the objective function, compared to the others. Consider a diagonal weight matrix:

This weighting produces the objective function:

If the behavior of the solution is too poor on one sub-objective \(i\) among our \(k\) sub-objectives, meaning \((R_i x^* - s_i)^2\) is larger than what we want at the optimum \(x^*\), we can improve it by increasing \(w_i\). However, this will come at the price of decreasing the performance on some other sub-objectives \(j \neq i\).

To explain this informally, let us get rid of the scale invariance of these weights. The facts that all weights are positive makes the expression \(\| R x - s \|^2_W\) a conical combination of squared residuals \((R_i x - s_i)^2\). We can account for this invariance by choosing weights as a convex combination:

Given any weighted LS problem, we can always renormalize the weights (dividing them by their collective sum) to get to an equivalent problem where weights form a convex combination. We can then focus without loss of generality on problems whose weights form a convex combination.

Intuitively, increasing a weight \(w_i\) results in improved performance of the corresponding objective, that is, a decrease in its corresponding squared residual \((R_i x^* - s_i)^2\) at the optimum. But now, the sum of weights being equal to 1, we see how an increase in \(w_i\) must be balanced by decreasing other weights \(w_j\) \((j \neq i)\), and thus lower the performance on other sub-objectives. Overall, there is no free lunch: when it is not possible to solve all sub-objectives perfectly (either because they contradict each other or because of inequality and equality constraints), the next best thing we can do is trade some sub-objectives for others.

## To go further

Least squares is covered in Chapter 10 of Nocedal and Wright's Numerical Optimization, as well as in Section 1.2 of Boyd and Vandenberghe's Convex Optimization. It is closely related to quadratic programming, and we can cast LS problems to QP in order to use readily available QP solvers.

Using relative weights in a linearly constrained optimization is a particular instance of lexicographic optimization with two levels of priority. Thinking in terms of levels of priority sheds a new light on problem shaping. For instance, when one weight gets predominant over the others, our LS problem becomes (increasingly difficult to solve and) closer to a three-level optimization. In that case, we can get better performance by using a proper lexicographic solver. See Section 2 of Geometric and numerical aspects of redundancy for a concrete discussion on this.