Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 25 December 2024

How to Calculate Principal Component Analysis (PCA) from Scratch in Python

An important machine learning method for dimensionality reduction is called Principal Component Analysis.

It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

In this tutorial, you will discover the Principal Component Analysis machine learning method for dimensionality reduction and how to implement it from scratch in Python.

After completing this tutorial, you will know:

The procedure for calculating the Principal Component Analysis and how to choose principal components.
How to calculate the Principal Component Analysis from scratch in NumPy.
How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Let’s get started.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

Principal Component Analysis
Manually Calculate Principal Component Analysis
Reusable Principal Component Analysis

Principal Component Analysis

Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data.

It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data.

The PCA method can be described and implemented using the tools of linear algebra.

PCA is an operation applied to a dataset, represented by an n x m matrix A that results in a projection of A which we will call B. Let’s walk through the steps of this operation.

1
2
3
4
5
     a11, a12
A = (a21, a22)
     a31, a32
 
B = PCA(A)

The first step is to calculate the mean values of each column.

1
M = mean(A)

1
2
              (a11 + a21 + a31) / 3
M(m11, m12) = (a12 + a22 + a32) / 3

Next, we need to center the values in each column by subtracting the mean column value.

1
C = A - M

The next step is to calculate the covariance matrix of the centered matrix C.

Correlation is a normalized measure of the amount and direction (positive or negative) that two columns change together. Covariance is a generalized and unnormalized version of correlation across multiple columns. A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.

1
V = cov(C)

Finally, we calculate the eigendecomposition of the covariance matrix V. This results in a list of eigenvalues and a list of eigenvectors.

1
values, vectors = eig(V)

The eigenvectors represent the directions or components for the reduced subspace of B, whereas the eigenvalues represent the magnitudes for the directions. For more on this topic, see the post:

Gentle Introduction to Eigendecomposition, Eigenvalues, and Eigenvectors for Machine Learning

The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for A.

If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of B that may be discarded.

A total of m or less components must be selected to comprise the chosen subspace. Ideally, we would select k eigenvectors, called principal components, that have the k largest eigenvalues.

1
B = select(values, vectors)

Other matrix decomposition methods can be used such as Singular-Value Decomposition, or SVD. As such, generally the values are referred to as singular values and the vectors of the subspace are referred to as principal components.

Once chosen, data can be projected into the subspace via matrix multiplication.

1
P = B^T . A

Where A is the original data that we wish to project, B^T is the transpose of the chosen principal components and P is the projection of A.

This is called the covariance method for calculating the PCA, although there are alternative ways to to calculate it.

Manually Calculate Principal Component Analysis

There is no pca() function in NumPy, but we can easily calculate the Principal Component Analysis step-by-step using NumPy functions.

The example below defines a small 3×2 matrix, centers the data in the matrix, calculates the covariance matrix of the centered data, and then the eigendecomposition of the covariance matrix. The eigenvectors and eigenvalues are taken as the principal components and singular values and used to project the original data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix, followed finally by the projection of the original matrix.

Interestingly, we can see that only the first eigenvector is required, suggesting that we could project our 3×2 matrix onto a 3×1 matrix with little loss.

1
2
3
4
5
6
7
8
9
10
11
12
[[1 2]
 [3 4]
 [5 6]]
 
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]
 
[ 8.  0.]
 
[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]

Reusable Principal Component Analysis

We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The benefit of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily.

When creating the class, the number of components can be specified as a parameter.

The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function.

Once fit, the eigenvalues and principal components can be accessed on the PCA class via the explained_variance_ and components_ attributes.

The example below demonstrates using this class by first creating an instance, fitting it on a 3×2 matrix, accessing the values and vectors of the projection, and transforming the original data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Principal Component Analysis
from numpy import array
from sklearn.decomposition import PCA
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
# create the PCA instance
pca = PCA(2)
# fit on data
pca.fit(A)
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)
# transform data
B = pca.transform(A)
print(B)

Running the example first prints the 3×2 data matrix, then the principal components and values, followed by the projection of the original matrix.

We can see, that with some very minor floating point rounding that we achieve the same principal components, singular values, and projection as in the previous example.

1
2
3
4
5
6
7
8
9
10
11
12
[[1 2]
 [3 4]
 [5 6]]
 
[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]
 
[  8.00000000e+00   2.25080839e-33]
 
[[ -2.82842712e+00   2.22044605e-16]
 [  0.00000000e+00   0.00000000e+00]
 [  2.82842712e+00  -2.22044605e-16]]

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Re-run the examples with your own small contrived matrix values.
Load a dataset and calculate the PCA on it and compare the results from the two methods.
Search for and locate 10 examples where PCA has been used in machine learning papers.

If you explore any of these extensions, I’d love to know.

Summary

In this tutorial, you discovered the Principal Component Analysis machine learning method for dimensionality reduction.

Specifically, you learned:

The procedure for calculating the Principal Component Analysis and how to choose principal components.
How to calculate the Principal Component Analysis from scratch in NumPy.
How to calculate the Principal Component Analysis for reuse on more data in scikit-learn.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 25 December 2024

How to Calculate Principal Component Analysis (PCA) from Scratch in Python

Tutorial Overview

Principal Component Analysis

Manually Calculate Principal Component Analysis

Reusable Principal Component Analysis

Extensions

Further Reading

Books

API

Articles

Tutorials

Summary

No comments:

Post a Comment

AI:Male vs female four season

Report Abuse

Labels

"Donate for a Noble Cause