PCA finds low dimensional representation of a dataset that contains as much as possible of the variation. As each of the $n$ observations lives on a $p$-dimensional space, and not all dimensions are equally interesting.

# Linear Algebra Review

Let $A$ be a $n\times n$ matrix. With $n=2, \ 3$, the determinant of $A$ can be calculated as follow.

Properties of determinant:

A real number $\lambda$ is an eigenvalue of $A$ if there exists a non-zero vector $x$ (eigenvector) in $\mathbb{R}^n$ such that:

The determinant of matrix $A - \lambda I$ is called the characteristic polynomial of $A$. The equation $det(A - \lambda I)$ is called the characteristic equation of $A$, where the eigenvalues $\lambda$ are the real roots of the equation. It can be shown that:

Matrix $A$ is invertible if there exists a $n\times n$ matrix $B$ such that $AB = BA = I$. A square matrix is invertible if and only if its determinant is non-zero. A non-square matrix do not have an inverse.

Matrix $A$ is called diagonalizable if and only if it has linearly independent eigenvectors. Let $\textbf{U}$ denote the eigen vectors of A and $\textbf{D}$ denote the diagonal $\lambda$ vector. Then:

If matrix $A$ is symmetric, then:

• all eigenvalues of $A$ are real numbers
• all eigenvectors of $A$ from distinct eigenvalues are orthogonal

Matrix $A$ is positive semi-definite if and only if any of the following:

• for any $n\times 1$ matrix $x$, $x^TAx \geq 0$
• all eigenvalues of $A$ are non-negative
• all the upper left submatrices $A_K$ have non-negative determinants.

Matrix $A$ is positive definite if and only if any of the following:

• for any $n\times 1$ matrix $x$, $x^TAx > 0$
• all eigenvalues of $A$ are positive
• all the upper left submatrices $A_K$ have positive determinants.

All covariance, correlation matrices must be symmetric and positive semi-definite. If there is no perfect linear dependence between random variables, then it must be positive definite.

Let $A$ be an invertible matrix, the LU decomposition breaks down $A$ as the product of a lower triangle matrix $L$ and upper triangle matrix $U$. Some applications are:

• solve $Ax=b$: $LUx=b \rightarrow Ly=b \text{ ; } Ux=y$
• solve $det(A)$: $det(A) = det(L)\ det(U)=\prod L_{i, \ i}\prod U_{j, \ j}$

Let $A$ be a symmetric positive definite matrix, the Cholesky decomponsition expand on the LU decomposition and breaks down $A=U^TU$, where $U$ is a unique upper triangular matrix with positive diagonal entries. Cholesky decomposition can be used to generate correltaed random variables in Monte Carlo simulation

# Matrix Interpretation

Consider a $n\times p$ matrix:

To find the first principal component $F^1$, we define it as the normalized linear combination of $X$ that has the largest variance, where its loading $\phi^1_j$ are normalized: $\sum^p_{j=1} (\phi^1_j)^2 = 1$

Or equivalently, for each score: $F^1_i = \sum_{j=1}^{p} \phi^1_jx_{ij}$

In matrix form:

Finally, the first principal component loading vector $\phi^1$ solves the optimization problem that maximize the sample variance of the scores $f^1$. An objective function can be formulated as follow and solved via an eigen decomposition:

To find the second principal component loading $\phi^2$, use the same objective function with $\phi^2$ replacement and include an additional constraint that $\phi^2$ is orthogonal to $\phi^1$.

# Geometric Interpretation

The $p\times k$ loading matrix $L = [\phi^1 \dots \phi^k]$ defines a linear transformation that projects the data from the feature space $\mathbb{R}^p$ into a subspace $\mathbb{R}^k$, in which the data has the most variance. The result of the projection is the factor matrix $F = [F^1 \dots F^k]$, also known as the principal components.

In other words, the principal components vectors $F^1 ... F^k$ forms a low-dimensional linear subspace that are the closest (shortest average squared Euclidean distance) to the observations.

# Eigen Decomposition

Given $n\times p$ data matrix $X$, the objective of PCA is to find a lower dimension representation factor matrix $F$, from which a $n\times p$ matrix $\tilde{X}$ can be constructed where distance between the covariance matrices $cov(X)$ and $cov(\tilde{X})$ are minimized.

The covariance matrix of $X$ is a $p\times p$ symmetric positive semi-definite matrix, therefore we have the following decomposition where $\textbf{u}$‘s’ are $p\times 1$ eigenvectors of $cov(X)$ and $\lambda$‘s are the eigenvalues. Note that $\textbf{u}$ can be a zero vector if the columns of $cov(X)$ are linearly dependent.

If we ignore the constant $1/(n-1)$, and define the $p\times p$ loading matrix $L_0=[\textbf{u}_1, \dots, \textbf{u}_p]$ and $n\times p$ factor matrix $F_0$ where $F_0^TF_0=\Lambda$. Then:

Now comes the PCA idea: Let’s rank the $\lambda_i$‘s in descending order, pick $k < p$ such that:

Now we observe that the matrix $cov(\tilde{X})$ is also a $p\times p$ positive semi-definite matrix. Following similar decomposition, we obtain a $p\times k$ matrix $L$ and $n\times k$ matrix $F$, where:

Here we have it, a dimension-reduced $n\times k$ factor matrix $F$, where its projection back to $n\times p$ space, $\tilde{X}$, has similar covariance as the original $n\times p$ dataset $X$.

# Practical Considerations

PCA excels at identifying latent variables from the measurable variables. PCA can only be applied to numeric data, while categorical variables need to be binarized beforehand.

• Centering: yes.

• Scaling:

• if the range and scale of the variables are different, correlation matrix is typically used to perform PCA, i.e. each variables are scaled to have standard deviation of $1$
• otherwise if the variables are in the same units of measure, using the covariance matrix (not standardizing) the variables could reveal interesting properties of the data
• Uniqueness: each loading vector $\phi^1$ is unique up to a sign flip, as the it can take on opposite direction in the same subspace. Same applies to the score vector $Z^1$, as $var(Z^1) = var(-Z^1)$

• Propotional of Variance Explained: we can compute the total variance in a data set in the first formula below. The variance explained by the $m$-th principal component is: $\dfrac{1}{n} \sum_{i=1}^n (z^m_i)^2$. Therefore, the second formula can be computed for the $PVE$:

Reference:

• An Introduction to Statistical Learning with Applications in R, James, Witten, Hastie and Tibshirani
• FINM 33601 Lecture Note, Y. Balasanov, University of Chicago