Yet Another PCA Explanation
I saw a post (since deleted) the other day on /r/MLQuestions asking about PCA, and it made me re-realize that, despite being commonly used by data scientists and scientists, PCA is often not well understood. Most explanations I come across either don’t include any linear algebra, which makes PCA seem like magic, or go all in on the math and lose the intuition (admittedly, I find the Wikipedia article on PCA pretty thorough, although the math is maybe a little advanced). In this blog post, I’m going to try to provide yet another explanation of PCA using data visualization (which PCA is often used for) as the motivation. I’ll be integrating linear algebra, Python code, and scatter plots to hopefully provide a complete picture of the technique.
Suppose we’re given a dataset X with only two columns, i.e., we’re not interested in doing dimensionality reduction at all. We’ll use two columns of the classic Iris dataset for demonstration purposes.
from sklearn import datasetsiris = datasets.load_iris()
X = iris.data[:, [0, 2]]
X = X - X.mean(axis=0)
y = iris.targetprint(X.mean(axis=0))# [-3.3158661e-16 -6.6317322e-16]
Notice that I “centered” the data (i.e., subtracted the mean vector) so it now has a mean vector of [0, 0]. This will come in handy later. Now, when we plot the data, we can see that it seems to “spread” primarily along the diagonal going up and to the right.
import matplotlib.pyplot as pltplt.scatter(X[:, 0], X[:, 1], c=y)
plt.axis("equal")
plt.show()
Let’s say we find this aesthetically unpleasing and would prefer it if the data was “spreading” primarily along the horizontal axis. We’re going to try to figure out a way to transform X into a new dataset, X_N, that exhibits the desired horizontal spread, but with the restriction that the transformation must preserve the information contained in the original dataset. Specifically, we’re going to linearly transform X using an orthogonal matrix, which means the distance between any two vectors (i.e., rows) in the dataset will be the same after the transformation (meaning all of the information in the original dataset is conserved). To be more concrete about our notion of “spread”, we’ll say we want to maximize the variance in the first coordinate (column) of X_N.
To get a sense for how much work we need to do, let’s take a look at the variances of the two coordinates in the original dataset.
import numpy as npvar_x = np.sum(X[:, 0] ** 2) / len(X)
var_y = np.sum(X[:, 1] ** 2) / len(X)
total_var = var_x + var_y
print(f"Variance in x: {var_x:.4f}")
print(f"Proportion for x: {var_x / total_var:.4f}")
print(f"Variance in y: {var_y:.4f}")
print(f"Proportion for y: {var_y / total_var:.4f}")# Variance in x: 0.6811
# Proportion for x: 0.1804
# Variance in y: 3.0955
# Proportion for y: 0.8196
So about 18% of the total variance (i.e., the sum of the individual coordinate variances) is in the first coordinate. Obviously, we could simply swap the two coordinates, so 82% is the number we need to beat.
# Swap the coordinates using a permutation matrix.
P = np.array([
[0, 1],
[1, 0]])
X_P = (P @ X.T).Tvar_x = np.sum(X_P[:, 0] ** 2) / len(X_P)
var_y = np.sum(X_P[:, 1] ** 2) / len(X_P)
total_var = var_x + var_y
print(f"Variance in x: {var_x:.4f}")
print(f"Proportion for x: {var_x / total_var:.4f}")
print(f"Variance in y: {var_y:.4f}")
print(f"Proportion for y: {var_y / total_var:.4f}")# Variance in x: 3.0955
# Proportion for x: 0.8196
# Variance in y: 0.6811
# Proportion for y: 0.1804plt.scatter(X_P[:, 0], X_P[:, 1], c=y)
plt.axis("equal")
plt.show()
At this point, I’m going to do something that might seem kind of weird at first and change our focus to the transpose of X, X_T, instead of X.
X_T = X.T
The reason I’m doing this is because we’re interested in transforming the data samples (i.e., the rows of X), and it makes some of the math more intuitive if we can think of each data sample as a column instead of a row.
Now for our first linear algebra concept. A useful way to think about a single sample of data (like X_T[:, 0] = [-0.7433, -2.3587]) is as a coordinate vector corresponding to the standard basis. The standard basis for two dimensions consists of two vectors:
To obtain our sample vector, we multiply its first coordinate by the first basis vector and its second coordinate by the second basis vector, and then we add these scaled vectors together, i.e.:
A convenient way to write the above is as a matrix vector multiplication with the basis vectors combined as columns in a single matrix:
Or, in Python:
e_x = np.array([[1, 0]]).T
e_y = np.array([[0, 1]]).T
standard_basis = np.hstack([e_x, e_y])
print(standard_basis)# [[1 0]
# [0 1]]c_1 = X_T[:, 0:1] # This syntax retains the dimensions of the original array.# This is the first equation.
v_1 = c_1[0] * e_x + c_1[1] * e_y
print(v_1)# [[-0.74333333]
# [-2.35866667]]# Which gives us the same thing as the second equation.
print(standard_basis @ c_1)# [[-0.74333333]
# [-2.35866667]]
Let’s plot the basis vectors along with the sample vector to see how they compare to each other.
plt.scatter(X[:, 0], X[:, 1], c=y)
# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")
# Plot the basis vectors.
plt.quiver(
[0], [0], e_x[0], e_x[1], angles="xy", scale_units="xy", scale=1, color=["g"]
)
plt.quiver(
[0], [0], e_y[0], e_y[1], angles="xy", scale_units="xy", scale=1, color=["r"]
)
plt.show()
The green vector is our first standard basis vector ([1, 0]), the red vector is our second standard basis vector ([0, 1]), and the blue vector is our sample vector ([-0.7433, -2.3587]). Next, let’s scale the basis vectors using the coordinates from our sample vector and add them.
plt.scatter(X[:, 0], X[:, 1], c=y)# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")# Plot the basis vectors scaled by their coordinates.
plt.quiver(
[0],
[0],
c_1[0] * e_x[0],
c_1[0] * e_x[1],
angles="xy",
scale_units="xy",
scale=1,
color=["g"],
)
plt.quiver(
c_1[0] * e_x[0],
c_1[0] * e_x[1],
c_1[1] * e_y[0],
c_1[1] * e_y[1],
angles="xy",
scale_units="xy",
scale=1,
color=["r"],
)plt.show()
This is simply visually depicting what we wrote down mathematically here:
The first (green) basis vector gets shrunk and flipped horizontally (because of the negative sign) and the second (red) basis vector gets stretched and flipped vertically. When we add them together (i.e., lie them head to tail), we get the sample (blue) vector.
The reason thinking about our sample vectors like this is useful is because it suggests we can use a different, potentially more interesting basis (and correspondingly different coordinates) to define our sample vectors. Another equally valid, albeit useless basis would be to divide the standard basis vectors by two. In that case, our new coordinate vector would simply be the old one multiplied by two, i.e., [-1.4867, -4.7173]. Mathematically:
and in Python:
e_xh = np.array([[1, 0]]).T / 2
e_yh = np.array([[0, 1]]).T / 2
lame_basis = np.hstack([e_xh, e_yh])c_2 = 2 * c_1
print(c_2)# [[-1.48666667]
# [-4.716 ]]v_2 = c_2[0] * e_xh + c_2[1] * e_yh
print(v_2)# [[-0.74333333]
# [-2.35866667]]# Same as:
print(lame_basis @ c_2)
# [[-0.74333333]
# [-2.35866667]]
Again, let’s plot the basis vectors and compare them to the sample vector.
plt.scatter(X[:, 0], X[:, 1], c=y)# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")# Plot the basis vectors.
plt.quiver(
[0], [0], e_xh[0], e_xh[1], angles="xy", scale_units="xy", scale=1, color=["g"]
)
plt.quiver(
[0], [0], e_yh[0], e_yh[1], angles="xy", scale_units="xy", scale=1, color=["r"]
)plt.show()
Notice how these new basis vectors are half the length of the standard basis vectors. By scaling the coordinates appropriately, we can still build our sample vector using this shrunken basis.
plt.scatter(X[:, 0], X[:, 1], c=y)# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")# Plot the basis vectors scaled by their coordinates.
plt.quiver(
[0],
[0],
c_2[0] * e_xh[0],
c_2[0] * e_xh[1],
angles="xy",
scale_units="xy",
scale=1,
color=["g"],
)
plt.quiver(
c_2[0] * e_xh[0],
c_2[0] * e_xh[1],
c_2[1] * e_yh[0],
c_2[1] * e_yh[1],
angles="xy",
scale_units="xy",
scale=1,
color=["r"],
)plt.show()
Again, both the standard basis and the shrunken standard basis are equally valid. One way to think about different bases is as “units”. You can give someone your height in feet, meters, or centimeters, but the physical property of how tall you are stays the same even though the number in front of the unit changes. Stated another way, you can use whatever basis vectors you want (with some caveats) as long as you provide the correct coordinates.
Now for our second linear algebra concept. The dot product of a vector a with another vector b that has a length of one (i.e., a unit norm) tells us how much vector a extends along the “direction” of b. This is known as “vector projection”. Let’s revisit our original coordinate vector with this new knowledge.
The standard basis vectors have a unit norm. Therefore, the dot product of our coordinate vector with the standard basis vectors is vector projection:
Note that the basis vectors on the left have been transposed. In Python:
print(e_x.T @ v_1)
print(e_y.T @ v_1)# [[-0.74333333]]
# [[-2.35866667]]
Unsurprisingly, the sample vector v_1 extends precisely the value of each of its coordinates along the standard basis vectors since they were what we used to originally define the vector. But what if we look at other directions? For example, how far do you think v_1 extends along the direction of a normalized (i.e., divided by its length) v_1? Let’s check.
v_1_norm = np.linalg.norm(v_1)
v_1_normed = v_1 / v_1_norm
print(v_1_normed.T @ v_1)
print(v_1_norm)# [[2.47302505]]
# 2.473025048172559
Interesting… v_1 extends precisely the length of v_1 along the direction of v_1_normed! Now, imagine we decided to use a basis consisting of v_1_normed and v_1_normed_rotated, which is v_1_normed rotated 90°. What do you think v_1’s coordinates would be when using this basis?
R = np.array(
[[np.cos(np.pi / 2), -np.sin(np.pi / 2)], [np.sin(np.pi / 2), np.cos(np.pi / 2)]]
)
v_1_normed_rotated = R @ v_1_normed
print(v_1_normed)
print(v_1_normed_rotated)# [[-0.30057655]
# [-0.95375769]]
# [[ 0.95375769]
# [-0.30057655]]c_2_x = v_1_normed.T @ v_1
c_2_y = v_1_normed_rotated.T @ v_1print(c_2_x)
print(c_2_y)
print(c_2_x * v_1_normed + c_2_y * v_1_normed_rotated)# [[2.47302505]]
# [[1.11022302e-16]]
# [[-0.74333333]
# [-2.35866667]
Because v_1 has the same direction as v_1_normed and is orthogonal to v_1_normed_rotated (i.e., it doesn’t extend in that direction at all), the coordinate vector simply consists of the length of v_1 (c_2_x) and zero (c_2_y), i.e., [2.4730, 0]. This means we can write our earlier equality still another way:
Let’s plot these basis vectors and compare them to the sample vector.
plt.scatter(X[:, 0], X[:, 1], c=y)# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")# Plot the basis vectors.
plt.quiver(
[0],
[0],
v_1_normed[0],
v_1_normed[1],
angles="xy",
scale_units="xy",
scale=1,
color=["g"],
)
plt.quiver(
[0],
[0],
v_1_normed_rotated[0],
v_1_normed_rotated[1],
angles="xy",
scale_units="xy",
scale=1,
color=["r"],
)plt.show()
As expected, the green basis vector points in the same direction as our sample vector (by design) and the red basis vector is orthogonal to the green one. Now to scale them by the coordinates we calculated using vector projection and add them.
plt.scatter(X[:, 0], X[:, 1], c=y)# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")# Plot the basis vectors scaled by their coordinates.
plt.quiver(
[0],
[0],
c_2_x * v_1_normed[0],
c_2_x * v_1_normed[1],
angles="xy",
scale_units="xy",
scale=1,
color=["g"],
)
plt.quiver(
c_2_x * v_1_normed[0],
c_2_x * v_1_normed[1],
c_2_y * v_1_normed_rotated[0],
c_2_y * v_1_normed_rotated[1],
angles="xy",
scale_units="xy",
scale=1,
color=["r"],
)plt.show()
The red basis vector has entirely disappeared because its coordinate was zero. The key takeaway here is that:
for any basis consisting of orthonormal vectors [b_1, b_2, …, b_k], we can compute the coordinate vector for a vector x with respect to that basis by calculating the dot product of x with each of the basis vectors.
We now have all of the pieces we need to find our transformation. Stated mathematically, our goal is to find a new orthonormal basis B (where each column of B corresponds to a different basis vector) for X_T such that the variance of the first coordinates of X_N_T is as large as possible.
Let’s walk through this equation. Each column of X_T is a sample vector. Each row of B_T is a basis vector (because it’s the transpose of B). Therefore, taking the matrix vector product of B_T and the first column of X_T (X_T[:, 0:1]) will give us the first column of X_T_N, which itself contains the coordinates of the sample vector with respect to basis B! It’s worth noting at this point that the transpose of an orthonormal matrix is its inverse, and the transpose is itself an orthonormal matrix, which means:
I is the identity matrix, which can also be thought of as the standard basis matrix in this context. This last equation is simply stating in matrix notation what we wrote out earlier for a single column of X_T:
Because we only care about maximizing the variance of the first coordinate, we can ignore the second basis vector for now, so let b_1_T equal the first row of B_T. Then, the quantity we are trying to maximize can be written verbosely as:
where x_i_T is a column of X_T and x_N_T_bar is the mean of the first coordinate of X_N, which is equal to:
Recall that we centered X at the beginning, so x_T_bar is a vector of zeros, which means the variance simplifies to:
These next steps follow directly from Wikipedia. Because m is a constant, we can ignore it. Because b_1 has a unit norm, the numerator in the above quantity can be written as:
This value is known as the Rayleigh quotient, and it’s maximized by the eigenvector corresponding to the largest eigenvalue of:
X_T @ X
So now we know how to compute our first basis vector… phew!
(eig_vals, eig_vecs) = np.linalg.eig(X_T @ X)
max_eig_val_idx = np.argsort(eig_vals)[-1]
max_eig_val = eig_vals[max_eig_val_idx]
b_1 = eig_vecs[:, max_eig_val_idx : max_eig_val_idx + 1]
Because we want B to be orthonormal, the lazy way to find b_2 is to just rotate b_1 90°.
b_2_lazy = R @ b_1
Of course, this method won’t generalize to higher dimensions, so we need to do something more clever. Specifically, we’re going to use our third linear algebra concept, projection. Projection essentially eliminates dimensions from a set of vectors. To see what that means, let’s calculate the coordinates corresponding to b_1 for our whole dataset and plot the scaled basis vectors corresponding to each sample.
X_T_hat = X_T.copy()
# Get the coordinates for b_1.
x_N_T_hat = b_1.T @ X_T_hat
# Scale b_1 by the new coordinates.
scaled_b_1 = b_1 @ x_N_T_hat
X_T_hat_orig = X_T_hat - scaled_b_1
# Plot the scaled basis vectors.
plt.quiver(
X_T_hat_orig[0],
X_T_hat_orig[1],
scaled_b_1[0],
scaled_b_1[1],
angles="xy",
scale_units="xy",
scale=1,
)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.axis("equal")
plt.show()
Notice that the direction of these vectors seems to visually align with the axis of the data with the greatest amount of “spread”. To perform a projection, we’re going to remove this direction from the dataset by subtracting the scaled basis vectors from the samples (this is technically projection via vector rejection!). Let’s see what we’re left with after performing that step.
# Subtract the scaled basis vectors from the dataset (i.e., do the projection via vector rejection).
X_T_hat_proj = X_T_hat - scaled_b_1# This is equivalent to:
verbose = X_T_hat - b_1 @ b_1.T @ X_T_hat
# which is equivalent to:
I = np.eye(len(b_1))
more_verbose = I @ X_T_hat - b_1 @ b_1.T @ X_T_hat
# which is equivalent to:
projection = (I - b_1 @ b_1.T) @ X_T_hat
# where (I - b_1 @ b_1.T) is a projection matrix, which is why it's a projection by vector rejection!# Plot the projection vectors.
plt.quiver(
X_T_hat[0],
X_T_hat[1],
-scaled_b_1[0],
-scaled_b_1[1],
angles="xy",
scale_units="xy",
scale=1,
alpha=0.5,
)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.5)
plt.axis("equal")
plt.scatter(X_T_hat_proj[0, :], X_T_hat_proj[1, :], c=y)
plt.show()
So all of the points now lie on a single line, which means we only have one choice for a direction for our second basis vector. Another way of thinking about it is if you calculate the vector projection of these new points on b_1 (the direction we removed), all of the coordinates will be zero:
print(b_1.T @ X_T_hat_proj)# A bunch of very small numbers, i.e., "zeros".
I’m not going to go into the details of it, but the projection procedure we just performed guarantees this new direction will be orthogonal to the first basis vector’s direction. Let’s plot these new basis vectors and the sample vector like before.
(eig_vals_hat, eig_vecs_hat) = np.linalg.eig(X_T_hat_proj @ X_T_hat_proj.T)
max_eig_val_hat_idx = np.argsort(eig_vals_hat)[::-1][0]
max_eig_val_hat = eig_vals_hat[max_eig_val_hat_idx]
b_2 = eig_vecs_hat[:, max_eig_val_hat_idx : max_eig_val_hat_idx + 1]
c_3_x = b_1.T @ v_1
c_3_y = b_2.T @ v_1
print(c_3_x)
print(c_3_y)
print(c_3_x * b_1 + c_3_y * b_2)
plt.scatter(X[:, 0], X[:, 1], c=y)
# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")
# Plot the basis vectors.
plt.quiver(
[0],
[0],
b_1[0],
b_1[1],
angles="xy",
scale_units="xy",
scale=1,
color=["g"],
)
plt.quiver(
[0],
[0],
b_2[0],
b_2[1],
angles="xy",
scale_units="xy",
scale=1,
color=["r"],
)
plt.show()
Scale and add…
plt.scatter(X[:, 0], X[:, 1], c=y)
# Plot the sample vector.
plt.quiver(
[0], [0], c_1[0], c_1[1], angles="xy", scale_units="xy", scale=1, color=["b"]
)
plt.axis("equal")
# Plot the basis vectors scaled by their coordinates.
plt.quiver(
[0],
[0],
c_3_x * b_1[0],
c_3_x * b_1[1],
angles="xy",
scale_units="xy",
scale=1,
color=["g"],
)
plt.quiver(
c_3_x * b_1[0],
c_3_x * b_1[1],
c_3_y * b_2[0],
c_3_y * b_2[1],
angles="xy",
scale_units="xy",
scale=1,
color=["r"],
)
plt.show()
So everything looks right. Putting the projection steps (which are described in detail on Wikipedia) in a for loop will allow us to compute the PCA basis for an arbitrary number of dimensions.
X_T_hat = X_T.copy()
bases = [b_1]
for i in range(X_T_hat.shape[0] - 1):
# Get the coordinates for the last basis vector we added.
b = bases[-1]
x_N_T_hat = b.T @ X_T_hat
# Scale b by the new coordinates.
scaled_b = b @ x_N_T_hat
# Project.
X_T_hat -= scaled_b
# Get new basis vector.
(eig_vals_hat, eig_vecs_hat) = np.linalg.eig(X_T_hat @ X_T_hat.T)
max_eig_val_hat_idx = np.argsort(eig_vals_hat)[-1]
max_eig_val_hat = eig_vals_hat[max_eig_val_hat_idx]
new_b = eig_vecs_hat[:, max_eig_val_hat_idx: max_eig_val_hat_idx + 1]
bases.append(new_b)B = np.hstack(bases)
It turns out, the B discovered using this procedure is equivalent to just finding the eigenvectors of:
X_T @ X
We can calculate the coordinates for our sample vectors with respect to the PCA basis using vector projection like before.
X_N_T = B.T @ X_T
Now for the moment of truth. Is the variance in the first coordinate greater than 82% of the total variance?
var_x = np.sum(X_N_T[0, :] ** 2) / len(X)
var_y = np.sum(X_N_T[1, :] ** 2) / len(X)
total_var = var_x + var_y
print(f"Variance in x: {var_x:.4f}")
print(f"Proportion for x: {var_x / total_var:.4f}")
print(f"Variance in y: {var_y:.4f}")
print(f"Proportion for y: {var_y / total_var:.4f}")# Variance in x: 3.6375
# Proportion for x: 0.9632
# Variance in y: 0.1391
# Proportion for y: 0.0368
96% of the variance is now in the first coordinate… hooray! Finally, let’s plot the samples using their new coordinates.
plt.scatter(X_N_T[0, :], X_N_T[1, :], c=y)
plt.axis("equal")
plt.show()
Ahhhh, so horizontal.
While I used a dataset with two dimensions for demonstration purposes, all of the concepts I discussed apply without modification to datasets with more dimensions. The procedure is still the same — find a basis vector such that the the first coordinate of the transformed dataset has maximal variance. Remove that direction from the dataset and repeat. By retaining the first k coordinates of X_N, you reduce the dimensionality of your dataset in a way that maximizes the total variance contained in X_N with respect to k orthogonal directions.