Introduction

Least squares methods are commonly used and are important foundations for understanding many concepts in statistics, machine learning, and computational neuroscience. This tutorial will provide an introduction to ordinary and weighted least squares.

Ordinary Least Squares

The brain must continuously infer the causes of its sensory experiences. Indeed, the brain has no direct access to the world around it, but rather relies on sequences of action potentials sent from sensory receptors, which carry correlative information regarding the causes of those spike trains. These observed data will be denoted by the vector $\mathbf{x}$. We assume these data are generated by some unobservable (“hidden”) causes in the world, which we will denote by the vector $\boldsymbol\theta$. The generative model can be represented as in the following equation:

$\begin{equation} \mathbf{x} = \boldsymbol\beta \boldsymbol\theta \end{equation}$

where $\boldsymbol\beta$ denotes a matrix of parameters which transform $\boldsymbol\theta$ to $\mathbf{x}$. We can define the error in this model as

$\begin{equation} \mathbb{\xi} = \boldsymbol\beta\boldsymbol\theta - \mathbf{x}, \end{equation}$

and the loss function as

$\begin{equation} || \mathbb{\xi} ||^{2} = [\boldsymbol\beta\boldsymbol\theta - \mathbf{x}]^\top [\boldsymbol\beta\boldsymbol\theta - \mathbf{x}]. \end{equation}$

Let’s expand on the above Equation:

$\begin{aligned} || \mathbb{\xi} ||^{2} & = [\boldsymbol\beta\boldsymbol\theta - \mathbf{x}]^\top [\boldsymbol\beta\boldsymbol\theta - \mathbf{x}] \\ & = [\boldsymbol\theta^\top \boldsymbol\beta^\top \boldsymbol\beta\boldsymbol\theta - \mathbf{x}^\top \boldsymbol\beta\boldsymbol\theta - \boldsymbol\theta^\top \boldsymbol\beta^\top \mathbf{x} + \mathbf{x}^\top \mathbf{x}] \end{aligned}$

Taking the derivative of this function with respect to the causes $\boldsymbol\theta$,

$\begin{aligned} \frac{d}{d\boldsymbol\theta} || \mathbb{\xi} ||^{2} & = \frac{d}{d\boldsymbol\theta} [\boldsymbol\theta^\top \boldsymbol\beta^\top \boldsymbol\beta\boldsymbol\theta - \mathbf{x}^\top \boldsymbol\beta\boldsymbol\theta - \boldsymbol\theta^\top \boldsymbol\beta^\top \mathbf{x} + \mathbf{x}^\top \mathbf{x}] \\ & = 2\boldsymbol\beta^\top \boldsymbol\beta\boldsymbol\theta - 2 \boldsymbol\beta^\top \mathbf{x}. \end{aligned}$

Setting this to zero, we can solve for $\boldsymbol\theta$:

$\begin{equation} \boldsymbol\theta = (\boldsymbol\beta^\top \boldsymbol\beta)^{-1} \boldsymbol\beta^\top \mathbf{x}, \end{equation}$

which is the ordinary least squares equation.

Weighted Least Squares

Consider that some measurements of $\mathbf{x}$ may have less importance than others. In this case, we may want to account for differential value of the measurements of $\mathbf{x}$. We can vary the contribution of any given observation by assigning a weight value. We thus introduce a weight matrix $\mathbf{W}$ into $\mathbf{x} = \boldsymbol\beta\boldsymbol\theta$,

$\begin{equation} \mathbf{W}\mathbf{x} = \mathbf{W}\boldsymbol\beta\boldsymbol\theta, \end{equation}$

and consequently into $\mathbb{\xi} = \boldsymbol\beta\boldsymbol\theta - \mathbf{x}$:

$\begin{equation} \mathbf{W}\mathbb{\xi} = \mathbf{W}\boldsymbol\beta\boldsymbol\theta - \mathbf{W}\mathbf{x}. \end{equation}$

Consider the case of $n$ independent and identically distributed (iid) measurements, $\mathbf{x} \in \lbrace x_{1}, x_{2}, \ldots, x_{n} \rbrace$. Each measurement will have an associated variance, which we will assume is the same given the iid property; this will obviate the need for subscripting, and the variance will be denoted as $\sigma^2$. Let us suppose, for now, that

$\mathbf{W} = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & & \\ \vdots & & \ddots & \vdots \\ 0 & & \cdots & \sigma^2 \\ \end{bmatrix}.$

This is one particular instance of a weight matrix you might choose. Typically, however, one would want to assign \textit{higher} weights to measurements with greater precision, i.e. $1/\sigma^2$.

Just as in the OLS case, we take the derivative of the cost function and set it to zero to solve for $\boldsymbol\theta$.

$\begin{aligned} || \mathbb{\mathbf{W}\xi} ||^{2} & = [\mathbf{W}\boldsymbol\beta\boldsymbol\theta - \mathbf{W}\mathbf{x}]^\top[\mathbf{W}\boldsymbol\beta\boldsymbol\theta - \mathbf{W}\mathbf{x}]\\ \frac{d}{d\boldsymbol\theta} || \mathbf{W}\xi ||^{2} & = \frac{d}{d\boldsymbol\theta}[\boldsymbol\theta^\top \boldsymbol\beta^\top \mathbf{W}^\top \mathbf{W} \boldsymbol\beta \boldsymbol\theta - \boldsymbol\theta^\top \boldsymbol\beta^\top \mathbf{W}^\top \mathbf{W}\mathbf{x} - \mathbf{x}^\top \mathbf{W}^\top \mathbf{W} \boldsymbol\beta \boldsymbol\theta + \mathbf{x}^\top \mathbf{W}^\top \mathbf{W} \mathbf{x}] \\ 0 & = 2 \boldsymbol\beta^\top \mathbf{W}^\top \mathbf{W} \boldsymbol\beta \boldsymbol\theta - 2\boldsymbol\beta^\top \mathbf{W}^\top\mathbf{W}\mathbf{x} \\ & \\ \boldsymbol\theta & = (\boldsymbol\beta^\top \mathbf{W}^\top \mathbf{W} \boldsymbol\beta)^{-1}\boldsymbol\beta^\top \mathbf{W}^\top\mathbf{W}\mathbf{x} \end{aligned}$

Note that setting $\mathbf{W} = \mathbf{I}$, where $\mathbf{I}$ is the identity matrix, results in ordinary least squares:

$(\boldsymbol\beta^\top \mathbf{I}^\top \mathbf{I} \boldsymbol\beta)^{-1}\boldsymbol\beta^\top \mathbf{I}^\top \mathbf{I} \mathbf{x} = (\boldsymbol\beta^\top \boldsymbol\beta)^{-1}\boldsymbol\beta^\top \mathbf{x}$