Discriminant Analysis

Distance-based Discriminator

Given dataset of $N$ datapoints $X = {x_{i}}_{i = 1}^{N}$ (where $x_{i} \in R^{D}$ ) and corresponding labels $y = {y_{i}}_{i = 1}^{N}$ (where each $y_{i} \in {0, 1, ..., C - 1}$ ), we can extract some useful patterns among the datapoints and then essentially forget the dataset. This procedure is called learning and the models which learn useful parameters from the dataset are known as parametric models. For simplicity we can assume that there are only 2 classes ${+, -}$ ( $+$ can be considered 1 and $-$ can be considered $0$ or $- 1$ ).

In the dataset described in the above image, we already know the labels. So, we can learn the means of the class conditional datapoints, i.e.

μ_{j} = \frac{1}{\sum _{k} 1 _{{y_{i} = j}} x _{k}} i \sum N 1_{{y_{k} = j}} x_{i}

We can create a simple discriminator function by comparing the new datapoint with the class conditional means, with the label of the new datapoint being same as the most similar mean, i.e.

y_{n e w} = ar g j max ⟨ μ_{j}, x_{n e w} ⟩

Fisher Discriminant Analysis

We can additionally assume the dataset to be sampled from a Gaussian. Fisher thought of trying to find a linear projection of the given dataset. This data can be projected on the standard basis or on other subspaces as shown below (linear projections):

As we can see, not all choices for the linear projection are equally got at preserving the class information and giving a classifier which might generalize well.

We want to identify a vector for linear projection that

maximizes the between-class spread, and
minimizes the within-class spread.

Let the means after the linear projection be $m_{1}$ and $m_{2}$ , and the variances be $s_{1}$ and $s_{2}$ .

So, we would like to make the following ratio big:

R = \frac{( m _{1} - m _{2} ) ^{2}}{s _{1}^{2} + s _{2}^{2}}

We can assume that $m_{1} = w^{T} μ_{1}$ and $m_{2} = w^{T} μ_{2}$ where $μ_{1}$ and $μ_{2}$ are the actual means of the class conditional data distributions. Therefore:

m_{1} - m_{2} = w^{T} (μ_{1} - μ_{2})

Similarly $s_{1}^{2} + s_{2}^{2}$ can be written as:

w^{T} S_{1} w + w^{T} S_{1} w = w^{T} Sw

where $S$ is the covariance matrix of the entire dataset.

Again,

(w^{T} (μ_{1} - μ_{2}))^{2} = w^{T} (μ_{1} - μ_{2}) (μ_{1} - μ_{2})^{T} w = w^{T} Mw

Therefore, the objective is:

w max R (w) = \frac{w ^{T} Mw}{w ^{T} Sw}

We can take the gradient:

\nabla_{w} R (w) = (w^{T} Sw) \frac{\partial}{\partial w} (w^{T} Mw) - (w^{T} Mw) \frac{\partial}{\partial w} (w^{T} Sw) ⟹ (w^{T} Sw) (2 Mw) - (w^{T} Mw) (2 Sw) ⟹ Mw ⟹ Mw ⟹ Rw ⟹ Rw ⟹ Rw ⟹ w ⟹ w = 0 = 0 = \frac{w ^{T} Mw}{w ^{T} Sw} Sw = R Sw = S^{- 1} Mw = S^{- 1} (μ_{1} - μ_{2}) (μ_{1} - μ_{2})^{T} w = (μ_{1} - μ_{2})^{T} w S^{- 1} (μ_{1} - μ_{2}) = \frac{( μ _{1} - μ _{2} ) ^{T} w}{R} S^{- 1} (μ_{1} - μ_{2}) \propto S^{- 1} (μ_{1} - μ_{2})

Thus, the projection surface is along the direction $S^{- 1} (μ_{1} - μ_{2})$ .

Probabilistic view of Discriminant Analysis

We can learn a generative model and estimate the distribution that generated the dataset for each class separately, i.e. $P (x_{i} ∣ y_{i} = k)$ . Given the generative model, we can use Bayes rule to design an optimal classifier.

Since we are in the binary classification domain ( $y_{i} \in {0, 1}$ ):

P (y_{i} = 1∣ x_{i}) = \frac{P ( x _{i} ∣ y _{i} = 1 ) P ( y _{i} = 1 )}{P ( x _{i} ∣ y _{i} = 1 ) P ( y _{i} = 1 ) + P ( x _{i} ∣ y _{i} = 0 ) P ( y _{i} = 0 )}

We can denote prior probabilities $P (y_{i} = k)$ as $π_{k}$ where $π_{1} + π_{0} = 1$ and $π_{k}$ can be estimated from the dataset as:

π_{k} = \frac{1}{N} i = 1 \sum N 1_{{y_{i} = k}}

Therefore:

P (y_{i} = 1∣ x_{i}) = \frac{π _{1} P ( x _{i} ∣ y _{i} = 1 )}{π _{0} P ( x _{i} ∣ y _{i} = 0 ) + π _{1} P ( x _{i} ∣ y _{i} = 1 )}

In the context of Discriminant Analysis, we assume that the class conditional datapoints were sampled from Gaussian distributions. Since we are considering the binary classification problem, the parameters are $(μ_{1}, Σ_{1})$ and $(μ_{0}, Σ_{0})$ for the positive and negative class respectively.

Given a class label, it is easy to estimate the parameters Class conditional sample mean: $μ_{k} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}$ , and Class conditional sample covariance matrix: $Σ_{k} = \frac{1}{N - 1} \sum_{i = 1}^{N} (x_{i} - μ_{k}) (x_{i} - μ_{k})^{T}$ Additionally, we already know the priors: $π_{k} = \frac{1}{N} \sum_{i = 1}^{N} 1_{{y_{i} = k}}$

Now all we need to do is find a decision rule. The decision boundary is given by the region where $P (y_{i} = 0∣ x_{i}) = P (y_{i} = 1∣ x_{i})$ .

Quadratic Discriminant Analysis

For a gaussian:

P (y_{i} = 0∣ x_{i}) ⟹ P (x_{i} ∣ y_{i} = 0) π_{0} ⟹ lo g P (x_{i} ∣ y_{i} = 0) + lo g π_{0} ⟹ lo g ((2 π)^{- D /2} ∣ Σ_{0} ∣^{- 1/2} exp (- 0.5 (x_{i} - μ_{0})^{T} Σ_{0}^{- 1} (x_{i} - μ_{0}))) lo g ((2 π)^{- D /2} ∣ Σ_{1} ∣^{- 1/2} exp (- 0.5 (x_{i} - μ_{1})^{T} Σ_{1}^{- 1} (x_{i} - μ_{1}))) = P (y_{i} = 1∣ x_{i}) = P (x_{i} ∣ y_{i} = 1) π_{1} = lo g P (x_{i} ∣ y_{i} = 1) + lo g π_{1} + lo g π_{0} = + lo g π_{1}

Classification rule (for QDA): Let the ratio be

R = \frac{lo g π _{1} - 0.5 lo g ∣ Σ _{1} ∣ - 0.5 ( x _{i} - μ _{1} ) ^{T} Σ _{1}^{- 1} ( x _{i} - μ _{1} )}{lo g π _{0} - 0.5 lo g ∣ Σ _{0} ∣ - 0.5 ( x _{i} - μ _{0} ) ^{T} Σ _{0}^{- 1} ( x _{i} - μ _{0} )}

Then $y_{i} = 1$ if $R > 1$ else $y_{i} = 0$ .

Linear Discriminant Analysis

We can further assume that we have the same covariance matrix for both the classes, i.e. $Σ_{0} = Σ_{1} = Σ$ .

We can expand

(x_{i} - μ_{j})^{T} Σ^{- 1} (x_{i} - μ_{j}) = (x_{i} - μ_{j})^{T} (Σ^{- 1} x_{i} - Σ^{- 1} μ_{j}) = x_{i}^{T} Σ^{- 1} x_{i} - 2 x_{i}^{T} Σ^{- 1} μ_{j} + μ_{j}^{T} Σ^{- 1} μ_{j}

Therefore,

lo g π_{1} + x_{i}^{T} Σ^{- 1} μ_{1} - 0.5 μ_{1}^{T} Σ^{- 1} μ_{1} ⟹ x_{i}^{T} Σ^{- 1} μ_{0} - x_{i}^{T} Σ^{- 1} μ_{1} ⟹ x_{i}^{T} Σ^{- 1} (μ_{0} - μ_{1}) = lo g π_{0} + x_{i}^{T} Σ^{- 1} μ_{0} - 0.5 μ_{0}^{T} Σ^{- 1} μ_{0} = lo g π_{1} - lo g π_{0} + 0.5 μ_{0}^{T} Σ^{- 1} μ_{0} - 0.5 μ_{1}^{T} Σ^{- 1} μ_{1} = lo g \frac{π _{1}}{π _{0}} + 0.5 (μ_{0}^{T} Σ^{- 1} μ_{0} - μ_{1}^{T} Σ^{- 1} μ_{1}) = C

where $C$ is a constant.

Notice that the term $x_{i}^{T} Σ^{- 1} (μ_{0} - μ_{1})$ is just the linear projection of $x_{i}$ onto the direction $Σ^{- 1} (μ_{0} - μ_{1})$ , which is similar to Fisher Discriminant Analysis. Actually, LDA is exactly same as FDA, just derived from the probabilistic perspective.

Updated decision rule (for LDA): $y_{i} = 1$ if $x_{i}^{T} Σ^{- 1} (μ_{0} - μ_{1}) > C$ else $y_{i} = 0$ .

Lecture Notes

Explorer

Discriminant Analysis

Distance-based Discriminator

Fisher Discriminant Analysis

Probabilistic view of Discriminant Analysis

Quadratic Discriminant Analysis

Linear Discriminant Analysis

Graph View

Table of Contents

Backlinks