## Background

There are many proximity-based algorithms that can be used to detect potential data outliers. For example, the KNN (K-Nearest Neighbors) is a distance-based algorithm that work well in capturing outliers with extreme values. Specifically, when using the KNN algorithm using the Mahalanobis distance as the distance metric, it tends to capture more meaningful and subtle outliers considering the correlations between features.

More specifically, it is best to apply Euclidean distance measurements only if the features are 1) independent and 2) having variances equal to 1. The second condition can usually be satisfied by standardizing while the first condition usually remains an issue.

On the other hand, Mahalanobis distance measure first transforms the features into uncorrelated features with variances equal to 1 by dividing the features by their covariance matrix (or practically, multiplying by the inverse of the covariance matrix), then applies Euclidean distance on the transformed feature.

## Problem

Given two observations $x$ and $y$:

While the KNN with Mahalanobis distance can be carried out in some available modules (i.e. this pyod algorithm), runtime is usually a concern. Because the matrix multiplication by $\Sigma^{-1}$ needs to be computed in each pairwise distance calculation and therefore significantly increase the runtime (imagine doing this with 10 million observations).

## Solution

To address this problem, we can first find the square root matrix $A$ such that $A^TA = \Sigma^{-1}$

And then basically just run the normal KNN algorithm (with Euclidean distances) on the transformed observation $x'$ and $y'$, which will be significantly faster as it avoids extensive matrix multiplications:

Well you may ask what if there are no real square root matrix $A$, in which case the tranformed $x'$ and $y'$ will have imaginary parts? Luckily, it seems most of the time we can just use the real part of $A$ to do the transformation and the subsequent Mahalanobis distance calculation, provided that the estimation error ($A^TA - \Sigma^{-1}$) is small.