Linear Models

Linear Regression

Ordinary Least Squares

Given dataset of $N$ datapoints $X = {x_{i}}_{i = 1}^{N}$ (where $x_{i} \in R^{D}$ ) and corresponding labels $y = {y_{i}}_{i = 1}^{N}$ , we essentially learn a vector of weights $w \in R^{D}$ and a scalar bias $b$ such that:

\overset{y}{^} = w^{T} x + b

where $\overset{y}{^}$ denotes the predicted output of our linear model. Notice that the bias $b$ can be included into the weights if the corresponding dimension of the features is $1$ , which essentially compresses the expression to:

\overset{y}{^} = w^{T} x

where $X$ is now a $N \times (D + 1)$ matrix with the first column being a vector of ones.

We can consider these as a system of linear equations. As we know a linear system may be consistent or inconsistent. In the former case where $N \leq D$ , solutions do exist, however the model overfits (memorizes the training data). This is undesirable since the model does not generalize well to new examples. In the latter case however, we don’t have unique solutions. For most practical purposes, $N > D$ leading the system to be inconsistent. So, what can we do?

We can think of a few ways to solve this problem:

We can simply take $w = X^{- 1} y$ , however since $X$ is not invertible ( $N > D$ ), this is not possible.
In that case, we can take the Moore-Penrose pseudo-inverse of $X$ . Let the Singular Value Decomposition (SVD) of $X = U Σ V^{T}$ where $V$ is the eigen vectors of $X^{T} X$ , $U$ is the eigen vectors of $X X^{T}$ , and $Σ$ is the diagonal matrix of singular values which are the square roots of the eigenvalues of $X^{T} X$ . Then,

w = V Σ^{- 1} U^{T} y

However, the eigenvalue and eigenvector calculation requires a lot of computations. Can we think of a way to reduce the computations?

Insert Image

We can think of the plane as the column space spanned by the dataset $C (X)$ , and we can assume that the ground truth targets lie outside the $C (X)$ plane. We can then take the projection of the ground truth labels to the $C (X)$ plane, that should correspond to the predicted labels. Recall that the predicted labels were calculated as $\overset{y}{^} = w^{T} x$ . Therefore the vector $Xw$ should lie in the column space of $C (X)$ . We can estimate the weights $w$ in the following 2 ways:

Method 1 (ERM)

Our aim is to minimize the empirical risk, which we can do by minimizing the error. We notice that the error is given by the differences of the actual label $y$ and the predicted label $\overset{y}{^}$ . We can take the squared differences (since square is a monotonically increasing function in the range $R^{+} \cup {0}$ ) and sum them up across all the datapoints in the dataset to get the loss function:

L oss = i = 1 \sum N \frac{1}{2} ϵ^{2} = i = 1 \sum N \frac{1}{2} (y_{i} - \overset{y}{^}_{i})^{2} = i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} ... [1]

This objective is popularly known as the “mean squared error” (MSE). However, since the loss has to be calculated over the entire dataset, we can always write it in the vectorized form:

L oss = \frac{1}{2} ∣ ∣ y - Xw ∣ ∣_{2}^{2}

where $∣∣ \cdot ∣∣$ is the $L_{2}$ norm. Now, in order to minimize the loss, we can take the gradient with respect to $w$ :

\nabla_{w} L oss = \nabla_{w} \frac{1}{2} ∣ ∣ y - Xw ∣ ∣_{2}^{2} = \frac{1}{2} \cdot 2 (y - Xw) \cdot (- X) = X^{T} (Xw - y) ... [1]

Taking the Hessian, we get:

\nabla_{w}^{2} L oss = \nabla_{w} X^{T} (Xw - y) = X^{T} X

which is positive definite (assuming full rank, i.e. rank is $D$ ), thus if we equate the gradient of the loss to 0, we should obtain the minima:

\nabla_{w} L oss = X^{T} (Xw - y) X^{T} Xw w = 0 = X^{T} y = (X^{T} X)^{- 1} X^{T} y

Note that the inverse of the matrix $X^{T} X$ is not possible unless it is full rank, and hence our previous assumption was reasonable.

Method 2 (Projection)

Alternatively, we can directly calculate the projection of the ground truth labels to the $C (X)$ plane. The error of can be expressed as $ϵ = y - Xw$ . Since we know that the vector $ϵ$ is perpendicular to the plane $C (X)$ , we get the following relation:

X^{T} ϵ X^{T} (y - Xw) X^{T} y - X^{T} Xw X^{T} Xw w = 0 = 0 = 0 = X^{T} y = (X^{T} X)^{- 1} X^{T} y

Iterative Methods for Linear Regression

We notice that the $(X^{T} X)^{-} 1$ operation takes $O (D^{3})$ computations, which can be costly if D is large. Also $X^{T} X$ itself requires $O (N D^{2})$ FLOPs to compute, which can be costly since N is often very high. Hence it would be useful if we had a way to optimize the parameters without the requirement of inverting the $X^{T} X$ matrix.

We can use the Gradient Descent or Stochastic Gradient Descent (SGD) to optimize the parameters when N is large. Let $f_{w^{t}} (x_{i}, y_{i})$ be the loss function, then:

Gradient Descent:

Initialize $w^{0} \sim N (0, I)$
while $∥ w^{t} - w^{t - 1} ∥ > ϵ$ :

w^{t + 1} \leftarrow w^{t} - η \nabla_{w} f_{w^{t}} (X, y)

In Gradient descent, notice that the gradient is calculated for the entire dataset. However, that might also be computationally expensive. Hence, we can use Stochastic Gradient Descent.

Stochastic Gradient Descent:

Initialize $w^{0} \sim N (0, I)$
while $∥ w^{t} - w^{t - 1} ∥ > ϵ$ :

i w^{t + 1} \sim U (1, N) \leftarrow w^{t} - η \nabla_{w} f_{w^{t}} (x_{i}, y_{i})

However, this requires us to compute the gradient for each datapoint, which might not be a good estimate of the gradient. Hence, we can use Mini-Batch Gradient Descent which is a tradeoff between the above two methods and the batch size can be chosen depending on the availability of RAM.

Mini-Batch Gradient Descent:

Initialize $w^{0} \sim N (0, I)$
while $∥ w^{t} - w^{t - 1} ∥ > ϵ$ :

i_{1}, \dots, i_{m} X_{i} y_{i} w^{t + 1} \sim U (1, N) without replacement = (x_{i_{1}}, \dots, x_{i_{m}}) = (y_{i_{1}}, \dots, y_{i_{m}}) \leftarrow w^{t} - η \nabla_{w} f_{w^{t}} (X_{i}, y_{i})

In our case, the loss function $f_{w^{t}} (X_{i}, y_{i}) = \frac{1}{2} ∣ ∣ y_{i} - X_{i} w^{t} ∣ ∣_{2}^{2}$ and the gradient is given by $[1]$ i.e. $\nabla_{w} f_{w^{t}} (X_{i}, y_{i}) = X_{i}^{T} (X_{i} w^{t} - y_{i})$ .

Aside (Data pre-processing):

We saw that in linear regression, the each feature is multiplied with the corresponding weight ( $w^{T} x = w_{1} x_{1} + w_{2} x_{2} + ... + w_{D} x_{D}$ ). Now in case that the units are very different, the linear regression will not give very useful results. For instance, say the height and weight of a person is used to predict age. So, the height and weight are the features $x_{1}, x_{2}$ respectively, and the age is the target $y$ . Now, suppose that the features are in SI units, which means that the height is in meters and the weight is in kilograms. Say a person weighs 1.5m and weighs 50kg. Therefore: $\overset{y}{^} = w_{1} * 1.5 + w_{2} * 50$ . Notice that the height of the person will be given less weightage since the weight (in SI units) dominates. Ideally this can be resolved by learning a very low $w_{2}$ that can compensate for the high value of weight, it’s not a good idea (more details in regularization). We can alternatively scale the dataset (feature-wise) to ensure that the units are in the same scale as follows:

Normalization: Let $x_{j}^{m i n}$ be the minimum value of the datapoints along the $j^{t h}$ feature column, and similarly $x_{j}^{m a x}$ be the maximum value. Then: $x_{ij} = (x_{ij} - x_{j}^{m i n}) / (x_{j}^{m a x} - x_{j}^{m i n}) \forall i \in [N], j \in [D]$
Standardization: Let $μ_{j}$ be the average value of the datapoints along the $j^{t h}$ feature column, and similarly let $σ_{j}$ be the square root of the sample variance. Then: $x_{ij} = (x_{ij} - μ_{j}) / σ_{j} \forall i \in [N], j \in [D]$

Note: that the nomenclature might slightly confusing, so rather it’s easier to remember the former as min-max scaling, while the later can be thought of as scaling the features as if they were sampled from isotropic standard Gaussian distributions each.

Probabilistic Perspective of Linear Regression

Assume that the dataset ${x_{i}, y_{i}}_{i = 1}^{N}$ was generated using the following process:

y_{i} = w^{T} x_{i} + ϵ ... [2]

where $ϵ \sim N (0, Σ)$ . For simplicity, we can assume unidimensional $x$ , which in turn makes $ϵ \sim N (0, σ^{2})$ .

insert image

Taking expectation we get:

E [y_{i}] = E [w^{T} x_{i} + ϵ] = E [w^{T} x_{i}] + E [ϵ] = E [w^{T} x_{i}]

Which means that: if we model $y$ as $w^{T} x$ , then we should be fine because in expectation we will be getting no errors. We can reparameterize $[2]$ by assuming that the $y_{i}$ is generated from a Gaussian with mean $w^{T} x_{i}$ and variance $σ^{2}$ . Thus the likelihood of $y_{i}$ given $x_{i}$ is:

L (w ∣ x_{i}, y_{i}) = p_{w} (y_{i} ∣ x_{i}) = \frac{1}{2 π σ ^{2}} exp {- \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}}}

And consequently, the log likelihood is:

l (w ∣ x_{i}, y_{i}) = lo g p_{w} (y_{i} ∣ x_{i}) = lo g \frac{1}{2 π σ ^{2}} - \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}}

Therefore, the log likelihood of the dataset is:

l (w ∣ x, y) ⟹ ar g w max l (w ∣ x, y) ⟹ L oss = lo g i = 1 \prod N p_{w} (y_{i} ∣ x_{i}) = i = 1 \sum N lo g p_{w} (y_{i} ∣ x_{i}) = i = 1 \sum N lo g \frac{1}{2 π σ ^{2}} - \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}} = N lo g \frac{1}{2 π σ ^{2}} - i = 1 \sum N \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}} = ar g w max - i = 1 \sum N \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}} = ar g w min i = 1 \sum N \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}} = ar g w min i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} = i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2}

which is same as the loss function in the previous section (equation $[1]$ ). This is easily extendable to the multi-variate case.

Regularized Linear Regression

We can Impose a prior on the weights, and get the aposteriori estimates. Let’s assume $w \sim N (0, ξ^{2})$ .

w_{M A P} ⟹ L os s_{M A P} = ar g w max i = 1 \prod N (\frac{1}{2 π σ ^{2}} exp {- \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}}}) (\frac{1}{2 π ξ ^{2}} exp {- \frac{w ^{2}}{2 ξ ^{2}}}) = ar g w max i = 1 \prod N c_{1} exp {- \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}}} c_{2} exp {- \frac{w ^{2}}{2 ξ ^{2}}} = ar g w min i = 1 \sum N \frac{( y _{i} - w ^{T} x _{i} ) ^{2}}{2 σ ^{2}} + \frac{w ^{2}}{2 ξ ^{2}} = i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} + \frac{σ ^{2} w ^{2}}{2 ξ ^{2}} = i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} + λ \frac{w ^{2}}{2}

where $λ = σ^{2} / ξ^{2}$ . In the multivariate case, we have:

L os s_{M A P} = i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} + \frac{λ}{2} w^{T} w = i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} + \frac{λ}{2} ∥ w ∥_{2}^{2} = \frac{1}{2} ∥ y - Xw ∥_{2}^{2} + \frac{λ}{2} ∥ w ∥_{2}^{2}

This is known as $L_{2}$ -regularized regression or Ridge regression and the $λ$ can be considered as a hyper-parameter controlling the degree of regularization.

Taking the gradient of the loss w.r.t. $w$ and equating it to zero, we get:

w_{M A P}^{*} = (X^{T} X + \frac{σ ^{2}}{ξ ^{2}} I)^{- 1} X^{T} y = (X^{T} X + λ I)^{- 1} X^{T} y

Notice that unlike $X^{T} X$ , the matrix $(X^{T} X + λ I)$ is positive definite and therefore invertible without any additional assumptions.

Why regularize?

The objective of regularization is to constrain the parameters in a certain way. Regularization is often used to deal with the problem of overfitting in high capacity (over-parameterized) models.

L $_{2}$ Regularization: We constrain the $L_{2}$ norm of the parameters, which results in parameters bering “small” dimension wise. In the ridge regression (assuming appropriate scaling has been performed on the dataset), notice that the regularization can be interpreted as roughly giving equal importance to each feature. We can rewrite the loss as Total loss = Reconstruction loss + $λ \cdot$ Regularization, where: Reconstruction loss: $\sum_{i} (y_{i} - x_{i 1} w_{1} - x_{i 2} w_{2} - ... - x_{i D} w_{D})^{2}$ , and Regularization: $w_{1}^{2} + w_{2}^{2} + ... + w_{D}^{2}$ Consequently, if any feature (say $j$ ) gets more weightage, i.e. the corresponding $w_{j}$ is high, then it must justify itself by reducing the loss function by the corresponding increase. For instance, let $w_{j}$ be the original value and $w_{j}^{'}$ be the new value such that $w_{j}^{'} > w_{j}$ , then let $Δ w_{j} = w_{j}^{'} - w_{j} > 0$ . The penalty for change from $w_{j}$ to $w_{j}^{'}$ is $Δ w_{j}^{2} + 2 w_{j} Δ w_{j}$ . If the change is small, we can say that the penalty $\in O (w_{j})$ . Similarly, the corresponding reduction in the reconstruction loss $\in O ((y - Xw)^{T} X_{j})$ where $X_{j}$ is the $j^{t h}$ column of the dataset $X$ . Increasing $w_{j}$ by $Δ w_{j}$ will only happen if this reconstruction term compensates for the corresponding penalty.
L $_{0}$ Regularization: An alternate approach is to select a subset of features which can explain the data. In many datasets, there are lots of features. Since inference takes $O (D)$ time (due to the inner product computation), reducing the number of features that is used for the linear regression model reduces the inference time. For instance, gene expression datasets have a very high number of features, and both training and inference time can benefit from the modeling using a subset of features. However, subset selection is np-hard and hence L $_{0}$ -regularization is computationally intractable.
L $_{1}$ Regularization: We can relax the L $_{0}$ Regularization to L $_{1}$ Regularization which gives us an approximate solution to the subset selection problem.

w_{L_{1}}^{*} = ar g w min i = 1 \sum N \frac{1}{2} (y_{i} - w^{T} x_{i})^{2} + \frac{λ}{2} j = 1 \sum D w_{j} = ar g w min \frac{1}{2} ∥ y - Xw ∥_{2}^{2} + \frac{λ}{2} ∥ w ∥_{1}

Implementation details: Upon convergence, $w_{L_{1}}^{*}$ often has some dimensions where the values are extremely low $(< 1 0^{- 10})$ . We can remove those features from the dataset as they are not very important and removing them reduces the size of the model, thus speeding up inference time when deployed.

Logistic Regression

insert image

The logit function/transform:

σ (y) = \frac{1}{1 + exp ( - y )} = \frac{exp ( y )}{1 + exp ( y )}

The logistic regression is used for classification. Say we have two classes, and each items belong to these classes ${0, 1}$ . We can transform the output of the linear model into a conditional probability mass corresponding to each class by applying the logit transform.

p_{w} (y = 1∣ x) = \frac{1}{1 + exp ( - w ^{T} x )}

and

p_{w} (y = 0∣ x) = 1 - p_{w} (y = 1∣ x) = \frac{1}{1 + exp ( w ^{T} x )}

Therefore, Likelihood of $y_{i}$ given $x_{i}$ : $p_{w} (y_{i} ∣ x_{i}) = σ (w^{T} x_{i})^{y_{i}} \cdot (1 - σ (w^{T} x_{i}))^{(1 - y_{i})}$ Likelihood of dataset: $\prod_{i = 1}^{N} p_{w} (y_{i} ∣ x_{i})$ Log Likelihood of dataset: $lo g \prod_{i = 1}^{N} p_{w} (y_{i} ∣ x_{i}) = \sum_{i = 1}^{N} lo g p_{w} (y_{i} ∣ x_{i}) = \sum_{i = 1}^{N} y_{i} lo g σ (w^{T} x_{i}) + (1 - y_{i}) lo g (1 - σ (w^{T} x_{i}))$ Notice that in the log likelihood, when $y = 1$ the first term is non-zero and the second term is zero (due to the multiplication in the front), and similarly when $y = 0$ , the second term is non-zero and the first term is zero.

Maximizing the log likelihood is equivalent to minimizing the negative of the log likelihood. Therefore: Negative log likelihood loss (NLL Loss) $= - \sum_{i = 1}^{N} [y_{i} lo g σ (w^{T} x_{i}) + (1 - y_{i}) (1 - lo g σ (w^{T} x_{i}))]$

This formulation of NLL Loss is also known as the Binary Cross Entropy Loss. Replacing $σ (y) = \frac{1}{1 + e x p ( - y )}$ , we get:

L oss = - i = 1 \sum N [y_{i} lo g \frac{1}{1 + exp ( - w ^{T} x _{i} )} + (1 - y_{i}) lo g (1 - \frac{1}{1 + exp ( - w ^{T} x _{i} )})] = - i = 1 \sum N [y_{i} lo g \frac{exp ( w ^{T} x _{i} )}{1 + exp ( w ^{T} x _{i} )} + (1 - y_{i}) lo g (\frac{1}{1 + exp ( w ^{T} x _{i} )})] = - i = 1 \sum N [y_{i} lo g exp (w^{T} x_{i}) - y_{i} lo g (1 + exp (w^{T} x_{i})) - (1 - y_{i}) lo g (1 + exp (w^{T} x_{i}))] = - i = 1 \sum N [y_{i} w^{T} x_{i} - y_{i} lo g (1 + exp (w^{T} x_{i})) - lo g (1 + exp (w^{T} x_{i})) + y_{i} lo g (1 + exp (w^{T} x_{i}))] = - i = 1 \sum N [y_{i} w^{T} x_{i} - lo g (1 + exp (w^{T} x_{i}))]

Regularization:

Similar to linear regression, we can regularize logistic regression by constraining the weights as follows:

L os s^{L_{p}} = - i = 1 \sum N [y_{i} w^{T} x_{i} - lo g (1 + exp (w^{T} x_{i}))] + λ ∥ w ∥_{p}^{p}

Generalize to multi-class classification:

Let there be $∣ C ∣$ number of classes in which a datapoint can belong. One way to extend a binary classifier to multiple classes is to simply train $∣ C ∣$ different parallel models each modeling the probability of the datapoints belonging to one of the $C$ classes. We can relabel the $y$ such that $y_{i}^{'} = 1$ if $y_{i} = c$ else $y_{i}^{'} = 0$ , and then train a binary logistic regression classifier using the new $y^{'}$ as the ground truth. During inference, we have $∣ C ∣$ different models. Notice that each model is a vector. So, we can indicate the $∣ C ∣$ linear models as a matrix $W_{D \times ∣ C ∣}$ . Therefore upon performing $XW$ , we will get a $N \times C$ matrix as output. For each sample (row), we can simply predict the index corresponding to the maximum value as the class label. This method is known as One vs All (OVA) or One vs Rest (OVR) classifier.

Additionally, for each row, we can compute the softmax of the outputs to obtain a probability distribution of the sample belonging to a given class. Let the output for $x_{i}$ by the OVA model be $\hat{y}$ which is a $∣ C ∣$ dimensional vector. Then the probability is:

p_{W} (y_{i} = c ∣ x_{i}) = \frac{exp ( y ^ _{c} )}{\sum _{j = 1}^{∣ C ∣} exp ( y ^ _{j} )}

Additional Topics

Weighted Linear Regression

Let’s consider a dataset $D = {(x_{i}, y_{i}, r_{i})}_{i = 1}^{N}$ , where $x_{i} \in R^{D}$ are the features, $y_{i} \in R$ are the labels, and $r_{i} > 0$ are the weighing factor, i.e. some datapoints are given more weights than others (similar to policy gradient algorithms in reinforcement learning). Then the ERM objective is:

w^{*} = ar g w min i = 1 \sum N \frac{1}{2} r_{i} (y_{i} - w^{T} x_{i})^{2} w^{*} = ar g w min \frac{1}{2} (Xw - y)^{T} R (Xw - y)

where $R$ is a diagonal matrix with $r_{i}$ on the diagonal.

(Xw - y)^{T} R (Xw - y) = (Xw - y)^{T} (RXw - Ry) = (Xw - y)^{T} RXw - (Xw - y)^{T} Ry = w^{T} X^{T} R Xw - w^{T} X^{T} Ry - y^{T} R Xw + y^{T} Ry = w^{T} X^{T} R Xw - 2 w^{T} X^{T} Ry + y^{T} Ry

Upon taking the gradient of the loss, we get:

\nabla_{w} (Xw - y)^{T} R (Xw - y) = 2 X^{T} RXw - 2 X^{T} Ry = 2 X^{T} R (Xw - y)

Equating to zero, we get:

X^{T} RXw ⟹ w^{*} = X^{T} Ry = (X^{T} RX)^{- 1} X^{T} Ry

Comments on NLL Loss

Let the label for a given datapoint $y$ be a one hot vector with the $c^{t h}$ index being $1$ if the label of $x$ is $c$ . So, for the given datapoint ${x, y}$ , (cross entropy) loss is:

j = 1 \sum ∣ C ∣ y_{j} lo g p_{w} (x)_{j} = j = 1 \sum ∣ C ∣ y_{j} lo g \frac{exp ( W _{j}^{T} x )}{\sum _{k} exp ( W _{k}^{T} x )}

where $W_{j}$ represents the $j^{t h}$ column of the matrix $W$ , i.e. the $j^{t h}$ linear model. Note that entropy is defined as $H (x) = p (x) lo g p (x)$

Minimizing negative log likelihood is same as minimizing the KL divergence between the data distribution and the distribution predicted by the model.

Learned distribution: $p_{W} (y ∣ x)$ Data distribution: $p (y ∣ x)$

KL Divergence:

D_{K L} (p ∣∣ p_{W}) = E_{p (y ∣ x)} [lo g \frac{p ( y ∣ x )}{p _{W} ( y ∣ x )}] = \sum p (y ∣ x) lo g \frac{p ( y ∣ x )}{p _{W} ( y ∣ x )} = \sum p (y ∣ x) lo g p (y ∣ x) - \sum p (y ∣ x) lo g p_{W} (y ∣ x)

The former term is the entropy which is independent of $W$ and the latter term is the cross-entropy which is dependent on $W$ .

W^{*} = ar g W min D_{K L} (p ∣∣ p_{W}) = ar g W min - \sum p (y ∣ x) lo g p_{W} (y ∣ x) = ar g W min - \sum 1 {y = j} lo g p_{W} (y ∣ x)

Which is essentially the NLL Loss. It is for this reason that the NLL Loss for categorical variables is also known as the Categorical Cross Entropy Loss. $p (y ∣ x)$ can be written as the indicator since in the dataset we have no uncertainty, i.e. we know the ground truth $y$ for every $x$ .

Lecture Notes

Explorer