# A Deeper Dive into PCA

PCA finds low dimensional representation of a dataset that contains as much as possible of the variation. As each of the observations lives on a -dimensional space, and not all dimensions are equally interesting.

# Linear Algebra Review

Let be a matrix. With , the `determinant`

of can be calculated as follow.

Properties of determinant:

A real number is an `eigenvalue`

of if there exists a non-zero vector (`eigenvector`

) in such that:

The determinant of matrix is called the `characteristic polynomial`

of . The equation is called the `characteristic equation`

of , where the eigenvalues are the real roots of the equation. It can be shown that:

Matrix is `invertible`

if there exists a matrix such that . A square matrix is invertible if and only if its determinant is non-zero. A non-square matrix do not have an inverse.

Matrix is called `diagonalizable`

if and only if it has linearly independent eigenvectors. Let denote the eigen vectors of A and denote the diagonal vector. Then:

If matrix is `symmetric`

, then:

- all eigenvalues of are real numbers
- all eigenvectors of from distinct eigenvalues are orthogonal

Matrix is `positive semi-definite`

if and only if any of the following:

- for any matrix ,
- all eigenvalues of are non-negative
- all the upper left submatrices have non-negative determinants.

Matrix is `positive definite`

if and only if any of the following:

- for any matrix ,
- all eigenvalues of are positive
- all the upper left submatrices have positive determinants.

All `covariance`

, `correlation`

matrices must be `symmetric`

and `positive semi-definite`

. If there is no perfect linear dependence between random variables, then it must be `positive definite`

.

Let be an invertible matrix, the `LU decomposition`

breaks down as the product of a lower triangle matrix and upper triangle matrix . Some applications are:

- solve :
- solve :

Let be a symmetric positive definite matrix, the `Cholesky decomponsition`

expand on the `LU decomposition`

and breaks down , where is a `unique`

upper triangular matrix with positive diagonal entries. Cholesky decomposition can be used to generate correltaed random variables in Monte Carlo simulation

# Matrix Interpretation

Consider a matrix:

To find the first principal component , we define it as the normalized linear combination of that has the largest variance, where its `loading`

are normalized:

Or equivalently, for each `score`

:

In matrix form:

Finally, the first principal component loading vector solves the optimization problem that maximize the sample variance of the scores . An objective function can be formulated as follow and solved via an `eigen decomposition`

:

To find the second principal component loading , use the same objective function with replacement and include an additional constraint that is orthogonal to .

# Geometric Interpretation

The `loading`

matrix defines a linear transformation that projects the data from the feature space into a subspace , in which the data has the most variance. The result of the projection is the `factor`

matrix , also known as the `principal components`

.

In other words, the principal components vectors forms a low-dimensional linear subspace that are the closest (shortest average squared Euclidean distance) to the observations.

# Eigen Decomposition

Given data matrix , the objective of `PCA`

is to find a lower dimension representation factor matrix , from which a matrix can be constructed where distance between the covariance matrices and are minimized.

The covariance matrix of is a symmetric positive semi-definite matrix, therefore we have the following decomposition where ‘s’ are eigenvectors of and ‘s are the eigenvalues. Note that can be a zero vector if the columns of are linearly dependent.

If we ignore the constant , and define the `loading`

matrix and `factor`

matrix where . Then:

Now comes the `PCA`

idea: Let’s rank the ‘s in descending order, pick such that:

Now we observe that the matrix is also a positive semi-definite matrix. Following similar decomposition, we obtain a matrix and matrix , where:

Here we have it, a dimension-reduced factor matrix , where its projection back to space, , has similar covariance as the original dataset .

# Practical Considerations

PCA excels at identifying `latent`

variables from the `measurable`

variables. PCA can only be applied to `numeric`

data, while categorical variables need to be binarized beforehand.

`Centering`

: yes.`Scaling`

:- if the range and scale of the variables are different,
`correlation matrix`

is typically used to perform PCA, i.e. each variables are scaled to have standard deviation of - otherwise if the variables are in the same units of measure, using the
`covariance matrix`

(not standardizing) the variables could reveal interesting properties of the data

- if the range and scale of the variables are different,
`Uniqueness`

: each loading vector is unique up to a sign flip, as the it can take on opposite direction in the same subspace. Same applies to the score vector , as`Propotional of Variance Explained`

: we can compute the total variance in a data set in the first formula below. The variance explained by the -th principal component is: . Therefore, the second formula can be computed for the :

Reference:

- An Introduction to Statistical Learning with Applications in R, James, Witten, Hastie and Tibshirani
- FINM 33601 Lecture Note, Y. Balasanov, University of Chicago