{"id":9523,"date":"2026-04-18T12:00:00","date_gmt":"2026-04-18T12:00:00","guid":{"rendered":"https:\/\/www.askpython.com\/?p=9523"},"modified":"2026-04-18T03:39:46","modified_gmt":"2026-04-18T03:39:46","slug":"principal-component-analysis","status":"publish","type":"post","link":"https:\/\/www.askpython.com\/python\/examples\/principal-component-analysis","title":{"rendered":"Principal Component Analysis from Scratch in Python"},"content":{"rendered":"<p>Principal component analysis, or PCA in short, is famously known as a dimensionality reduction technique.<\/p>\n<p>It has been around since 1901 and is still used as a predominant dimensionality reduction method in machine learning and statistics. PCA is an unsupervised statistical method.<\/p>\n<p>In this article, we will have some intuition about PCA and will implement it by ourselves from scratch using Python and NumPy.<\/p>\n<h2 class=\"wp-block-heading\">Why use PCA in the first place?<\/h2>\n<p>To support the cause of using PCA let\u2019s look at one example.<\/p>\n<p><strong>Suppose we have a dataset<\/strong> having two variables and 10 data points. If we were asked to visualize the data points, we could do it very easily. The result is very interpretable as well.<\/p>\n<figure class=\"wp-block-table\">\n<table>\n<tbody>\n<tr>\n<td>X1<\/td>\n<td>2<\/td>\n<td>8<\/td>\n<td>1<\/td>\n<td>4<\/td>\n<td>22<\/td>\n<td>15<\/td>\n<td>25<\/td>\n<td>29<\/td>\n<td>4<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td>X2<\/td>\n<td>3<\/td>\n<td>6<\/td>\n<td>2<\/td>\n<td>6<\/td>\n<td>18<\/td>\n<td>16<\/td>\n<td>20<\/td>\n<td>23<\/td>\n<td>6<\/td>\n<td>4<\/td>\n<\/tr>\n<\/tbody>\n<\/table><figcaption class=\"wp-element-caption\">Example Data points<\/figcaption><\/figure>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"550\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Plotting-data-on-Two-Dimensions.jpeg\" alt=\"Plotting Data On Two Dimensions\" class=\"wp-image-9525\" style=\"width:400px;height:275px\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Plotting-data-on-Two-Dimensions.jpeg 800w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Plotting-data-on-Two-Dimensions-300x206.jpeg 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Plotting-data-on-Two-Dimensions-768x528.jpeg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><figcaption class=\"wp-element-caption\"><strong>Plotting Data On Two Dimensions<\/strong><\/figcaption><\/figure>\n<p>Now, if we try to increase the number of variables, it gets almost impossible for us to imagine a dimension higher than three.\u00a0<\/p>\n<p>This problem we face when analyzing higher-dimensional datasets is what is commonly referred to as \u201c<strong><em>The curse of dimensionality<\/em><\/strong>\u201d. This term was first coined by\u00a0Richard E. Bellman.<\/p>\n<p>Principal Component analysis reduces high dimensional data to lower dimensions while retaining the original information and capturing the maximum variability of the dataset. Data visualization is the most common application of PCA. PCA is also used to make the training of an algorithm faster by reducing the number of dimensions of the data, which improves prediction efficiency.<\/p>\n<h2 class=\"wp-block-heading\">Implementation of PCA with Python<\/h2>\n<p>To grasp the maximum intuition from the content given below, we assume you must know a little bit about linear algebra and <a aria-label=\" (opens in a new tab)\" class=\"rank-math-link\" href=\"https:\/\/www.askpython.com\/python\/python-matrix-tutorial\" target=\"_blank\" rel=\"noreferrer noopener\">matrices<\/a>. If not, then we highly encourage you to watch the <a aria-label=\" (opens in a new tab)\" class=\"rank-math-link\" href=\"https:\/\/www.youtube.com\/watch?v=fNk_zzaMoSs\" target=\"_blank\" rel=\"noreferrer noopener\">Linear algebra series of 3Blue1Brown<\/a> on YouTube by Grant Sanderson, to get a refresher of the concepts, as it will prove to be very beneficial in your Machine Learning journey ahead.<\/p>\n<p>We can think of Principal Component analysis to be like fitting an n-dimensional ellipsoid to the data so that each axis (ax) of the ellipsoid represents a principal component. The larger the principal component axis, the larger the variability in data it represents.<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"576\" height=\"396\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Fitting-an-ellipse-to-data.jpg\" alt=\"Fitting An Ellipse To Data\" class=\"wp-image-9529\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Fitting-an-ellipse-to-data.jpg 576w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Fitting-an-ellipse-to-data-300x206.jpg 300w\" sizes=\"auto, (max-width: 576px) 100vw, 576px\" \/><figcaption class=\"wp-element-caption\">Example: <strong>Fitting An Ellipse To Data<\/strong><\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">&nbsp;Steps to implement PCA in Python<\/h2>\n<pre><code>#Importing required libraries\nimport numpy as np<\/code><\/pre>\n<h3 class=\"wp-block-heading\">1. Subtract the mean of each variable<\/h3>\n<p>Subtract the mean of each variable from the dataset so that the dataset is centered on the origin, with values ranging from min to max around zero. Doing this proves to be very helpful when calculating the covariance matrix.<\/p>\n<pre><code>#Generate a dummy dataset.\nX = np.random.randint(10,50,100).reshape(20,5) \n# mean Centering the data  \nX_meaned = X - np.mean(X , axis = 0)<\/code><\/pre>\n<p>Data generated by the above code has dimensions (20,5) i.e. 20 examples and 5 variables for each example. The number of features in this sample dataset is 5. We calculated the mean of each variable and subtracted that from every row of the respective column.<\/p>\n<h3 class=\"wp-block-heading\">2. Calculate the Covariance Matrix<\/h3>\n<p>Calculate the Covariance Matrix of the mean-centered data. You can know more about the covariance matrix in this really informative Wikipedia article <a aria-label=\"here (opens in a new tab)\" rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Covariance_matrix\" target=\"_blank\" class=\"rank-math-link\">here<\/a>. <\/p>\n<p>The covariance matrix is a square matrix denoting the covariance and correlation structure of the elements with each other. The covariance of an element with itself is nothing but just its Variance, which is the square of the standard deviation.<\/p>\n<p>That\u2019s why the diagonal elements of a covariance matrix are just the variance of the elements.<\/p>\n<pre><code># calculating the covariance matrix of the mean-centered data.\ncov_mat = np.cov(X_meaned , rowvar = False)<\/code><\/pre>\n<p>We can easily calculate the covariance Matrix using <code>numpy.cov( )<\/code> a method. The default value for <code>rowvar<\/code> is set to <code>True<\/code>, remember to set it to <code>False<\/code> to get the covariance matrix in the required dimensions.<\/p>\n<h3 class=\"wp-block-heading\">3. Compute the Eigenvalues and Eigenvectors<\/h3>\n<p>Now, compute the Eigenvalues and Eigenvectors for the calculated Covariance matrix. The Eigenvectors of the Covariance matrix we get are Orthogonal to each other and each vector represents a new axis called a principal axis.<\/p>\n<p>A Higher Eigenvalue corresponds to the most variance and higher variability in that direction. Hence, the principal axis with the higher Eigenvalue will be an axis capturing higher variability in the data.<\/p>\n<p>Orthogonal means the vectors are mutually perpendicular to each other. Eigenvalues and vectors seem to be very scary until we get the idea and concepts behind it. <\/p>\n<pre><code>#Calculating Eigenvalues and Eigenvectors of the covariance matrix\neigen_values , eigen_vectors = np.linalg.eigh(cov_mat)<\/code><\/pre>\n<p>NumPy <code>linalg.eigh( )<\/code> method returns the eigenvalues and eigenvectors of a complex Hermitian or a real symmetric matrix.<\/p>\n<h3 class=\"wp-block-heading\">4. Sort Eigenvalues in descending order<\/h3>\n<p><strong>Sort the Eigenvalues in descending order along with their corresponding eigenvectors.<\/strong><\/p>\n<p>Remember, each column in the <strong>eigenvalue vector-matrix<\/strong> corresponds to a principal component, so arranging them in descending order of their Eigenvalue will automatically arrange the principal components in descending order of their variability.<\/p>\n<p>Hence the first column in our rearranged Eigen vector-matrix will be the first principal component that captures the highest variability.<\/p>\n<pre><code>#sort the eigenvalues in descending order\nsorted_index = np.argsort(eigen_values)[::-1]\n\nsorted_eigenvalue = eigen_values[sorted_index]\n#similarly sort the eigenvectors \nsorted_eigenvectors = eigen_vectors[:,sorted_index]<\/code><\/pre>\n<p>np.argsort returns an array of indices of the same shape, similar to how lambda x: x would work in sorting operations.<\/p>\n<h3 class=\"wp-block-heading\">5. Select a subset from the rearranged Eigenvalue matrix<\/h3>\n<p>Select a subset from the rearranged Eigenvalue matrix as per our need i.e. number_comp = 2. This means we selected the first two principal components. <\/p>\n<pre><code># select the first n eigenvectors, n is desired dimension\n# of our final reduced data.\n\nn_components = 2 #you can select any number of components.\neigenvector_subset = sorted_eigenvectors[:,0:n_components]<\/code><\/pre>\n<p>n_components = 2 means our final data should be reduced to just 2 variables. if we change it to 3 then we get our data reduced to 3 variables.<\/p>\n<h3 class=\"wp-block-heading\">6. Transform the data<\/h3>\n<p>Finally, perform a linear transformation of the data by having a dot product between the Transpose of the Eigenvector subset and the Transpose of the mean-centered data. By transposing the outcome of the dot product, the result we get is the data transformed into fewer dimensions from higher dimensions.<\/p>\n<pre><code>#Transform the data \nX_reduced = np.dot(eigenvector_subset.transpose(),X_meaned.transpose()).transpose()<\/code><\/pre>\n<p>The final dimensions of X_reduced will be ( 20, 2 ), and originally the data was of higher dimensions ( 20, 5 ).<\/p>\n<p>Now we can visualize our data with the available tools we have. Hurray! Mission accomplished.<\/p>\n<h2 class=\"wp-block-heading\">Complete Code for Principal Component Analysis in Python<\/h2>\n<p>Now, let\u2019s just combine everything above by making a function and try our Principal Component analysis from scratch on an example.<\/p>\n<pre><code>import numpy as np\n\ndef PCA(X , num_components):\n    \n    #Step-1\n    X_meaned = X - np.mean(X , axis = 0)\n    \n    #Step-2\n    cov_mat = np.cov(X_meaned , rowvar = False)\n    \n    #Step-3\n    eigen_values , eigen_vectors = np.linalg.eigh(cov_mat)\n    \n    #Step-4\n    sorted_index = np.argsort(eigen_values)[::-1]\n    sorted_eigenvalue = eigen_values[sorted_index]\n    sorted_eigenvectors = eigen_vectors[:,sorted_index]\n    \n    #Step-5\n    eigenvector_subset = sorted_eigenvectors[:,0:num_components]\n    \n    #Step-6\n    X_reduced = np.dot(eigenvector_subset.transpose() , X_meaned.transpose() ).transpose()\n    \n    return X_reduced<\/code><\/pre>\n<p>We defined a function implementing the PCA algorithm that accepts a data matrix and the number of components as input arguments.<\/p>\n<p>We&#8217;ll use the <a class=\"rank-math-link\" href=\"https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/iris\/iris.data\" target=\"_blank\" rel=\"noopener\">IRIS dataset<\/a> as our sample dataset and apply our PCA function to it<\/p>\n<pre><code>import pandas as pd\n\n#Get the IRIS dataset\nurl = \"https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/iris\/iris.data\"\ndata = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])\n\n#prepare the data\nx = data.iloc[:,0:4]\n\n#prepare the target\ntarget = data.iloc[:,4]\n\n#Applying it to PCA function\nmat_reduced = PCA(x , 2)\n\n#Creating a Pandas DataFrame of reduced Dataset\nprincipal_df = pd.DataFrame(mat_reduced , columns = ['PC1','PC2'])\n\n#Concat it with target variable to create a complete Dataset\nprincipal_df = pd.concat([principal_df , pd.DataFrame(target)] , axis = 1)<\/code><\/pre>\n<p><strong>Important Tip:<\/strong> Data standardization is crucial wherever necessary before applying any ML algorithm to it. In the above code, we did not standardize our data, but we did so while implementing PCA.<\/p>\n<p>Let&#8217;s plot our results using the <a href=\"https:\/\/www.askpython.com\/python-modules\/python-seaborn-tutorial\" class=\"rank-math-link\">seaborn<\/a> and <a href=\"https:\/\/www.askpython.com\/python-modules\/matplotlib\/python-matplotlib\" class=\"rank-math-link\">matplotlib<\/a> libraries.<\/p>\n<pre><code>import seaborn as sb\nimport matplotlib.pyplot as plt\n\nplt.figure(figsize = (6,6))\nsb.scatterplot(data = principal_df , x = 'PC1',y = 'PC2' , hue = 'target' , s = 60 , palette= 'icefire')<\/code><\/pre>\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-1024x768.jpeg\" alt=\"Reduced Dimension Plot\" class=\"wp-image-9567\" srcset=\"https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-1024x768.jpeg 1024w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-300x225.jpeg 300w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-768x576.jpeg 768w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-1536x1152.jpeg 1536w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-2048x1536.jpeg 2048w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-160x120.jpeg 160w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-320x240.jpeg 320w, https:\/\/www.askpython.com\/wp-content\/uploads\/2020\/10\/Reduced-dimension-plot-1600x1200.jpeg 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Reduced Dimension Plot<\/strong><\/figcaption><\/figure>\n<p>That&#8217;s it! It worked perfectly. <\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p>In this article, we learned about PCA, how PCA works, and implemented PCA using <a class=\"rank-math-link\" href=\"https:\/\/www.askpython.com\/python-modules\/numpy\/python-numpy-module\">NumPy<\/a>. Happy learning!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Principal component analysis, or PCA in short, is famously known as a dimensionality reduction technique. It has been around since 1901 and is still used as a predominant dimensionality reduction method in machine learning and statistics. PCA is an unsupervised statistical method. In this article, we will have some intuition about PCA and will implement [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":65273,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-9523","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-examples"],"blocksy_meta":[],"_links":{"self":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/9523","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/comments?post=9523"}],"version-history":[{"count":0,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/posts\/9523\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media\/65273"}],"wp:attachment":[{"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/media?parent=9523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/categories?post=9523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.askpython.com\/wp-json\/wp\/v2\/tags?post=9523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}