Data Reduction using Principal Component Analysis
The intent behind writing this article is to clear the basic concepts associated with Principal Component Analysis and how to use it in various advanced mathematical problems related to real life scenarios. PCA is often taught in the business analytics courses and is mostly meant for professionals involved in business intelligence projects related to standardization, cluster algorithms, and data visualization.
Let’s break down the role of PCA with respect to its applications in the various data science domains and how these relate to other data analysis techniques such as Factor Analysis, Correspondence Analysis, and Generalization.
In a normalized data management life cycle, analysts often have to transform information into relatable or ordered formats. Analysts look to break down the recorded set of variables into “Principal factors.”
The process of reducing or transforming or modifying numerical or textual information into correlated formats aggregates of valid data is called as Data Reduction. The data reduction process involves one or more of the editing operations, and one such technique is ‘Principal Component Analysis ’, which works with multiple variables within the data set and aims to reduce or aggregate these variable components into fewer associations. This can be performed on raw data as well as semi-structured data where analysts might be interested in creating a covariance matrix and deduce standardized analysis based on total variance of the overall information. In PCA, analysts work with the total correlation or total variance in the data set, and therefore, each variable measure is a demonstration of how this analysis works without measurement error.
Key Terminologies Associated with PCA Matrix
In order to fully understand PCA, you should learn these keywords.
Kaiser-Meyer-Olkin Measure of Sampling Adequacy
Also referred to as just KMO Analysis, this is the most important component of PCA.
The formula of the KMO test is written as follows:
where:
R = [rij] is the correlation matrix,
U = [uij] is the partial covariance matrix,
Σ = summation notation (“add up”).
KMO results show the suitability of any data reduction / data structuring as applied to a PCA. It is a basic statistical measure that delivers variance results between 0 and 1. Analysts prefer to keep their KMO variance closer to 1 as it generally reflects the usefulness of your data. Variance figures less than 0.5 indicate poor data analysis results.
Bartlett’s Test of Sphericity
Simply referred to as Bartlett’s Test, this is a null hypothesis test to measure the degree of variance between variables in an identity matrix. It is also used to distinctly figure out which analysis method should be used for data reduction. If the variance is less than 0.05, you should go with Factor Analysis. Higher values call for data reduction using PCA.
KMO and Bartlett’s Test are often studied together in the PCA to arrive at a minimum standard for passing a routine with either PCA or Factor Analysis.
Extraction
This is often studied in tandem with Communalities and Initial values. It is used to measure the degree of variance between each variable and how it relates to Principal components. Higher variance within the extraction process indicates common variables are well represented in the data set; otherwise, these are underrepresented.
Eigenvalues
What do Eigenvalues tell us?
Well, these are the talismans or oracles of data science analysis. PCA analysts use eigenvalues to determine the degree of variance for every data point, measured in a particular direction. According to Marcus and Minc, eigenvalues are defined as special scalars sets that are associated with a matrix equation or latent roots. In PCA, eigenvalues represent the variances of principal components related by virtue of the correlation matrix.
It can be represented as either total or percentage of variance accounted for each principal component.
Higher PCA models use a cumulative percentage of variance, accounting for current, previous, and subsequent principal components. This structuring helps PCA analysts to measure variables with a zero error.
How to Calculate PCA?
PCA can be calculated using the KMO test, or by manually putting data into NumPy. NumPy allows PCA analysts to find Principal Component Analysis using the covariance matrix of the central tendency. Else, you can also use PCA() from the sci-kit library. For both methods, the eigenvalues can be extracted using the explained_variance_ and components_ attributes.