Principal Component Analysis


Types of Analytics: || Types of Regression:



Introduction

The goal of this post is to provide a complete and simplified explanation of Principal Component Analysis, specifically how it works step by step so that anyone can understand and use it without needing a strong mathematical background.

What is Principal Component Analysis?

PCA, or Principal Component Analysis, is a dimensionality-reduction method that is frequently used to reduce the dimensionality of large data sets by transforming a large set of variables into a smaller set that still contains the majority of the information in the large set.

Obviously, reducing the number of variables in a data set reduces accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize, and because machine learning algorithms can analyze data much more easily and quickly without having to deal with extraneous variables. To summarize, the goal of PCA is to reduce the number of variables in a data set while retaining as much information as possible.

Before getting to the explanation, this post provides logical explanations of what PCA is doing in each step and simplifies the mathematical concepts behind it, like standardization, covariance, eigenvectors, and eigenvalues without focusing on how to compute them.

Standardization

The goal of this step is to standardize the range of continuous initial variables so that they all contribute equally to the analysis. More specifically, it is critical to perform standardization prior to PCA because the latter is very sensitive to the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those with larger ranges will dominate over those with small ranges (for example, a variable ranging from 0 to 100 will dominate over a variable ranging from 0 to 1), resulting in biased results. As a result, converting the data to comparable scales can help to avoid this problem.

This can be accomplished mathematically by subtracting the mean and dividing by the standard deviation for each value of each variable. After standardization, all variables will be transformed to the same scale.

Covariance Matrix

The goal of this step is to understand how the variables in the input data set differ from the mean in relation to each other, or to see if there is any relationship between them. Because variables are sometimes so highly correlated that they contain redundant information. So, to find these correlations, we compute the covariance matrix.

The covariance matrix is a p p symmetric matrix (where p is the number of dimensions) containing the covariances associated with all possible pairs of the initial variables as entries. For example, the covariance matrix for a three-dimensional data set with three variables x, y, and z is a 33 matrix of this form:

Because a variable's covariance with itself is its variance (Cov(a,a)=Var(a), we have the variances of each initial variable in the main diagonal (top left to bottom right). Because covariance is commutative (Cov(a,b)=Cov(b,a)), the covariance matrix entries are symmetric with respect to the main diagonal, implying that the upper and lower triangular portions are equal.

What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?

It is the sign of the covariance that is important: if it is positive, the two variables increase or decrease together (correlated) If both are negative, one increases while the other decreases (Inversely correlated).

Let us proceed to the next step now that we know the covariance matrix is nothing more than a table that summarizes the correlations between all possible pairs of variables.

The eigenvectors and eigenvalues of the covariance matrix

Eigenvectors and eigenvalues are linear algebra concepts that must be computed from the covariance matrix in order to determine the data's principal components. Before we get into the details of these concepts, let's define what we mean by "principal components."

Principal components are new variables that are created by linearly combining or combining the initial variables. These combinations are made in such a way that the new variables (i.e., principal components) are uncorrelated, and the majority of the information contained in the initial variables is squeezed or compressed into the first components. So, 10-dimensional data gives you ten principal components, but PCA tries to put as much information as possible in the first component, then as little as possible in the second, and so on.

It is critical to understand that the principal components are less interpretable and have no real meaning because they are constructed as linear combinations of the initial variables.

Principal components, in geometric terms, are the data directions that explain the greatest amount of variance, i.e. the lines that capture the most information in the data. The relationship between variance and information in this case is that the greater the variance carried by a line, the greater the dispersion of the data points along with it, and the greater the dispersion along a line, the greater the information it contains. To put it simply, think of principal components as new axes that provide the best angle for viewing and evaluating data, allowing the differences between observations to be more visible.

Feature vector

As we saw in the previous step, computing the eigenvectors and ordering them in descending order by their eigenvalues enables us to find the principal components in order of significance. In this step, we decide whether to keep all of these components or to discard those with less significance (low eigenvalues), and then combine the remaining ones to form a matrix of vectors known as the Feature vector.

So, the feature vector is simply a matrix with the eigenvectors of the components that we decide to keep as columns. This is the first step toward dimensionality reduction because if we keep only p eigenvectors (components) out of n, the final data set will only have p dimensions.

Recast the data along the principal components axes

Apart from standardization, no changes are made to the data in the preceding steps; you simply select the principal components and form the feature vector, but the input data set remains constant in terms of the original axes (i.e, in terms of the initial variables).

The goal of this final step is to reorient the data from the original axes to the ones represented by the principal components using the feature vector formed using the eigenvectors of the covariance matrix (hence the name Principal Components Analysis). This is accomplished by multiplying the original data set's transpose by the feature vector's transpose.


""I'm sure I don't have all of the answers or information about principal component analysis here." I'm hoping you'll share your thoughts with PCA in the comments area. In the comments, I'd love to hear your thoughts on this.” You can follow to this blog to receive notifications of new posts.


References:

[Steven M. Holland, Univ. of Georgia]: Principal Components Analysis skymind.ai]: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy Lindsay I. Smith] : A tutorial on Principal Component Analysis


Types of Analytics: || Types of Regression:


Is this article useful to you? If yes, please share this blog.


Share this blog:

Post a Comment

0 Comments