PCA finds low dimensional representation of a dataset that contains as much as possible of the variation. As each of the observations lives on a -dimensional space, and not all dimensions are equally interesting.

Linear Algebra Review

Let be a matrix. With , the determinant of can be calculated as follow.

Properties of determinant:

A real number is an eigenvalue of if there exists a non-zero vector (eigenvector) in such that:

The determinant of matrix is called the characteristic polynomial of . The equation is called the characteristic equation of , where the eigenvalues are the real roots of the equation. It can be shown that:

Matrix is invertible if there exists a matrix such that . A square matrix is invertible if and only if its determinant is non-zero. A non-square matrix do not have an inverse.

Matrix is called diagonalizable if and only if it has linearly independent eigenvectors. Let denote the eigen vectors of A and denote the diagonal vector. Then:

If matrix is symmetric, then:

  • all eigenvalues of are real numbers
  • all eigenvectors of from distinct eigenvalues are orthogonal

Matrix is positive semi-definite if and only if any of the following:

  • for any matrix ,
  • all eigenvalues of are non-negative
  • all the upper left submatrices have non-negative determinants.

Matrix is positive definite if and only if any of the following:

  • for any matrix ,
  • all eigenvalues of are positive
  • all the upper left submatrices have positive determinants.

All covariance, correlation matrices must be symmetric and positive semi-definite. If there is no perfect linear dependence between random variables, then it must be positive definite.

Let be an invertible matrix, the LU decomposition breaks down as the product of a lower triangle matrix and upper triangle matrix . Some applications are:

  • solve :
  • solve :

Let be a symmetric positive definite matrix, the Cholesky decomponsition expand on the LU decomposition and breaks down , where is a unique upper triangular matrix with positive diagonal entries. Cholesky decomposition can be used to generate correltaed random variables in Monte Carlo simulation

Matrix Interpretation

Consider a matrix:

To find the first principal component , we define it as the normalized linear combination of that has the largest variance, where its loading are normalized:

Or equivalently, for each score:

In matrix form:

Finally, the first principal component loading vector solves the optimization problem that maximize the sample variance of the scores . An objective function can be formulated as follow and solved via an eigen decomposition:

To find the second principal component loading , use the same objective function with replacement and include an additional constraint that is orthogonal to .

Geometric Interpretation

The loading matrix defines a linear transformation that projects the data from the feature space into a subspace , in which the data has the most variance. The result of the projection is the factor matrix , also known as the principal components.

In other words, the principal components vectors forms a low-dimensional linear subspace that are the closest (shortest average squared Euclidean distance) to the observations.

Eigen Decomposition

Given data matrix , the objective of PCA is to find a lower dimension representation factor matrix , from which a matrix can be constructed where distance between the covariance matrices and are minimized.

The covariance matrix of is a symmetric positive semi-definite matrix, therefore we have the following decomposition where ‘s’ are eigenvectors of and ‘s are the eigenvalues. Note that can be a zero vector if the columns of are linearly dependent.

If we ignore the constant , and define the loading matrix and factor matrix where . Then:

Now comes the PCA idea: Let’s rank the ‘s in descending order, pick such that:

Now we observe that the matrix is also a positive semi-definite matrix. Following similar decomposition, we obtain a matrix and matrix , where:

Here we have it, a dimension-reduced factor matrix , where its projection back to space, , has similar covariance as the original dataset .

Practical Considerations

PCA excels at identifying latent variables from the measurable variables. PCA can only be applied to numeric data, while categorical variables need to be binarized beforehand.

  • Centering: yes.

  • Scaling:

    • if the range and scale of the variables are different, correlation matrix is typically used to perform PCA, i.e. each variables are scaled to have standard deviation of
    • otherwise if the variables are in the same units of measure, using the covariance matrix (not standardizing) the variables could reveal interesting properties of the data
  • Uniqueness: each loading vector is unique up to a sign flip, as the it can take on opposite direction in the same subspace. Same applies to the score vector , as

  • Propotional of Variance Explained: we can compute the total variance in a data set in the first formula below. The variance explained by the -th principal component is: . Therefore, the second formula can be computed for the :




Reference:

  • An Introduction to Statistical Learning with Applications in R, James, Witten, Hastie and Tibshirani
  • FINM 33601 Lecture Note, Y. Balasanov, University of Chicago