Machine Learning

Unsupervised Learning

Dimensionality Reduction

Principal Component Analysis (PCA)


 Shows a lot of letters on a brick wall where the letters which are existing in the words Principal Component Analysis are bold and connected with lines. This should illustrate the Pricnipal Component Analysis in such a way that only the important letters are kind of extracted from a bunch of different letters. By omitting all non bold printed letters, we still can form the word Principal Component Analysis, that is by omitting the other letters, we reduce the size of the 'letter salad' without loosing too much information.

What is Principal Component Analysis


When we perform Principal Component Analysis (PCA) we want to find the principal components of a dataset. Surprising isn't it? Well, what are the principal components of a dataset and why do we want to find them, and what do they tell us? The principal components of a dataset are the "directions" in a dataset which hold the most variation (I assume that you have a basic understanding of the term variance. If not, look it up [here](https://www.mathsisfun.com/data/standard-deviation.html)). In simplified terms, the first principal component of a dataset is the direction along the dataset with the highest variation.

Consider the following dataset onto which I have drawn different "directions" shown by the differently colored arrows. What do you think, which arrow points into the direction with the largest variance of the dataset?
 Illustrates three differently colored arrows within a datset. The arrows should illustrate kind of principal components of the dataset. The orange arrow points into the direction with the largest variance
Well, by the naked eye we see that the orange arrow probably points into the direction with the largest variance.
Ok, but why do we need this direction(s)?
We want to have this direction (the direction with the largest variance) because in the future we want to use the principal components of the dataset to reduce dimensionality of our dataset the either make it "plottable" by reducing it to three or less dimensions, or simply to reduce the size of the dataset without loosing too much of the information. Reducing the dimensionality of our dataset is like creating new columns by combining columns such that the number of the new==combined columns is less than the original number of columns.
Imagine a dataset with only two columns A and B, then this dataset is said to be two dimensional. If we now combine these two columns to one column for instance by simply adding column one and two, the dataset is reduced to one dimension. To decide which columns should be combined and how we should combine them is kind of the goal of the PCA. Mind that this illustration is not 100% correct since the goal of the PCA is to transform the data and not simply cutting something off or combining something but for the first step this illustrations should to it. That is, we want to decrease the size of our dataset to make life easier for the algorithms or to simply visualize the data by making it 2 or 3 dimensional. But wait, I said decrease the size of the dataset, that is kind of "loosing something", correct? Correct! By reducing the dimensionality of a dataset we loose dimensions, that is, we loose information. Imagine a 3D movie where we remove the third dimension such that the remaining movie is two dimensional. We still can watch the film but we have lost some information. The question we have to find an answer to is: Which are the dimensions which held the most information of the dataset and which are the dimensions which held only little information - and therewith can be cut off without loosing too much information -. Finding these dimensions (the principal components) and transforming the dataset to a lower dimensional dataset using these principal components is the task of the PCA. As said, in the end we use the found and chosen principal component to transform our dataset, that is, projecting our dataset (the projection is done with matrix multiplication) using these principal components. By doing this, we get a dataset with reduced dimensionality (that is reduced size) without loosing too much information -hopefully-.
Ok, to proceed and for the understanding we have to go a small step back. We want to find the principal components because these are the "directions" of the dataset with the highest variance. You ask yourself: Why highest variance? Well, it turns out that the directions with the highest variance (principal components) are the most informative directions. Let's make this clear using a little graphical illustration:
 Illustrates three differently colored triangles and balls where the triangles are all assigned with the same weight (10kg) and the balls have different weights 10kg, 15kg and 20kg respectively
Since the values assigned to the triangles (let's assume the unit are kg) are all the same, the variance is 0 whereas the variance of the balls is $16.66 \ \ kg^{2}$
Now let's further assume someone chooses one ball and one triangle, tells you the assigned weight and wants you to make a prediction about the color. Whatever triangle the person chooses, the weight is always 10kg and therewith you have no chance to correctly predict the color of the triangle based on the kg number. Though, the weights of the balls differ (they have a higher variance as the triangles) and whatever weight the person will tell you, you can predict the color based on the number. To make this even more clear, assume in the next step the person wants you to do the same thing but now he or she does not tell you the exact number but only a number which is close to one of the above. For instance, 11kg. Based on this number you can predict the color of the ball as blue since 11 is closer to 10 than to 15. Hence, the farther away the assigned weights are, that is, the higher the variance is, the easier it is for you to predict the color. Please take the above as a principal idea why we can use the variance as a measure of informativeness and do not claim a 100% mathematical correctness.
Ok, now we have understood why we want to have the directions with the highest variance (principal components). But on our way the main question is still unanswered. *How do we get these principal components?* We get these principal components by finding the directions with the highest variance. Wise guy... we already know this. This sentence contains two important words: directions and variance - Finding the *direction* with the highest *variance* -. Well, we can do exactly the same as I have done above and simply draw an arbitrary line into the dataset. To know how good or bad this line is, we have to measure the variance of the data along this line. Now we know that the formula for the variance (of the population) is: $$var(x)= \frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n}$$ But here x is one dimensional and our dataset has two dimensions x and y, hence: Which of them should we use as $x$ in the variance calculation? Should we calculate the variance of x or y? The answer is: None of them is correct. Why? Look at the following illustration:
 Illustrates the directions x and y
In this picture, I have drawn two arrows, one points into the direction of the x axis and one into the direction of the y axis. What happens if I calculate the variance along these arrows? Well, I calculate the variance along the feature x and along the feature y. The dataset looks like:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("fivethirtyeight")
import numpy as np
data = np.array([[2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1],[2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9]])
print(data)
fig = plt.figure()
ax0 = fig.add_subplot(111)
ax0.scatter(data[0],data[1])
plt.show()
[[2.5 0.5 2.2 1.9 3.1 2.3 2.  1.  1.5 1.1]
 [2.4 0.7 2.9 2.2 3.  2.7 1.6 1.1 1.6 0.9]]
<Figure size 640x480 with 1 Axes>
where the fist list represents the x feature and the second list represents the y feature. Consider the code below. What happens if we calculate the variance of the dataset along the *x-arrow* and along the *y-arrow*? We calculate the variance along the feature x and y! We implicitly do this by ignoring the other dimension (feature) x or y respectively. That is, by ignoring x or y, we kind of project the data onto the x or y axis and therewith reduce the dimensionality, that is cut one dimension off.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("fivethirtyeight")
import numpy as np
data = np.array([[2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1],[2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9]])
print(data)
fig = plt.figure()
ax0 = fig.add_subplot(111)
ax0.scatter(data[0],data[1])
ax0.scatter(data[0],np.ones_like(data[1])*min(data[1])-0.2,color="red")
ax0.scatter(np.ones_like(data[0])*min(data[0])-0.2,data[1],color="blue")
ax0.arrow(min(data[0])-0.2,min(data[1])-0.2,0,max(data[1])-0.5,width=0.01,color="blue",alpha=0.4,length_includes_head="True")
ax0.arrow(min(data[0])-0.2,min(data[1])-0.2,max(data[0])-0.3,0,width=0.01,color="red",alpha=0.4,length_includes_head="True")
ax0.vlines(data[0],min(data[1])-0.2,data[1],colors="red",linestyles="--",linewidth=0.7)
ax0.hlines(data[1],min(data[0])-0.2,data[0],colors="blue",linestyles="--",linewidth=0.7)
plt.show()
[[2.5 0.5 2.2 1.9 3.1 2.3 2.  1.  1.5 1.1]
 [2.4 0.7 2.9 2.2 3.  2.7 1.6 1.1 1.6 0.9]]
Now you can see the effect of choosing the x and y axis as our principal components. We project the data onto the x and y axis. We can now calculate the variance of the projected data and see how good or bad we are. If you are confused how we can accomplish this transformation, keep calm, we come back to this later.
So now we have chosen the x and the y axis as our principal components. But as stated above, in that case this is most likely not correct because we have seen that the skewed (green) line from bottom left to top right is the line spanned by the vector which points into the direction of the highest variation == 1. principal component (at this point I have to mention that a dataset has as many principal components as it has dimensions but the first principal component is the "strongest"). So let's first of all assume, that this skewed green line is actually the first principal component of the dataset, that is, the vector which points into the direction of highest variation. How does this look like if we choose arbitrary vectors and do exactly the same thing as we have done taking the x and y direction as our assumed principal components (Mind that we want to project the data onto the line because we want to calculate the variation). Mind that since we now do actual calculations, we have to normalize the data to zero mean since otherwise the calculations fail.
We will now plot the dataset and choose arbitrary vectors whose values we can alter using sliders. We project the dataset onto the line spanned by the vector (defined by the slider values). We transform the dataset using the chosen vector and calculate the resulting variance. The point (the slider adjustment) which results in the largest variance gives us our first principal component. Additionally, we plot the "variance surface" with respect to the values we choose for the vector. So to summarize, by altering the direction of the line we want to find this line which leads to the highest variance when the dataset is projected onto this line. This line is the 1. principal component. If we choose lines parallel to the x and y axes, we simply cut off the other axis.
#%matplotlib notebook
import numpy as np
import pandas as pd
from ipywidgets import interact,interactive,fixed,interact_manual
import matplotlib.pyplot as plt
from matplotlib import style
from mpl_toolkits.mplot3d import Axes3D
style.use('fivethirtyeight')
def f(x,y):
    
    data = np.array([[2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1],[2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9]])
    data[0] = data[0]-np.mean(data[0])
    data[1] = data[1]-np.mean(data[1])
 
    # Create Axes
    fig = plt.figure(figsize=(10,10))
    ax0 = fig.add_subplot(121)
    ax0.set_aspect('equal')
    ax0.set_ylim(-2,2)
    ax0.set_xlim(-2,2)
    ax0.set_title('Search for Principal component',fontsize=14)
    ax0.set_xlabel('PC x value',fontsize=10)
    ax0.set_ylabel('PC y value',fontsize=10)
    
    #vec = np.array([0.6778734,0.73517866])
    vec = np.array([x,y])
    ax0.scatter(data[0],data[1])
    ax0.plot(np.linspace(min(data[0]),max(data[0])),(vec[1]/vec[0])*np.linspace(min(data[0]),max(data[0])),linewidth=1.5,color="black",linestyle="--")
    b_on_vec_list = [[],[]]
    for i in range(len(data[0])):
        a = vec
        b = np.array([data[0][i],data[1][i]])
        b_on_a = (np.dot(a,b)/np.dot(a,a))*a
        b_on_vec_list[0].append(b_on_a[0])
        b_on_vec_list[1].append(b_on_a[1])
        ax0.scatter(b_on_a[0],b_on_a[1],color='red')
        ax0.plot([b_on_a[0],b[0]],[b_on_a[1],b[1]],"r--",linewidth=1)
    ax1 = fig.add_subplot(122,projection='3d')
    ax1.set_aspect('equal')
    ax1.set_ylim(0,1)
    ax1.set_xlim(0,1)
    ax1.set_title('Varicane with respect to the 1. PC',fontsize=14)
    ax1.set_xlabel('PC x value',fontsize=10)
    ax1.set_ylabel('PC y value',fontsize=10)
    ax1.set_zlabel('variance',fontsize=10)
    
    # Transform data
    e_vec = (1/np.sqrt(np.dot(vec,vec.T)))*vec
    data_trans = np.dot(data.T,e_vec)
    
    # Plot the data
    ax0.scatter(data_trans,np.zeros_like(data_trans),c='None',edgecolor='black')
    # Plot the twisted line
    ax0.plot(np.linspace(min(data_trans),max(data_trans),10),np.zeros_like(data_trans),linestyle='--',color='grey',linewidth=1.5)
    # Plot the circles
    for i in range(len(data_trans)):
        ax0.add_artist(plt.Circle((0,0),data_trans[i],linewidth=0.5,linestyle='dashed',color='grey',fill=False))
    # Calculate the variance    
    ax0.text(0,-1.4,'variance= {0}'.format(str(np.round(np.var(data_trans),3))),fontsize=20)
    # Plot the variance with respect to the principal component vector
    # Initialize the meshgrid
    cross_x,cross_y =np.meshgrid(np.linspace(0.001,1,num=20),np.linspace(0.001,1,num=20))
    # Create the iterators in the format [(0.01,0.01),(0.01,0.0620),(0.01,0.114),...(0.0620,0.01),(0.0620,0.0620),(0.0620,0.1141),...(0.999,0.01),(0.999,0.0620),...(0.999,0.999)]
    x_y_pairs = []
    for i in range(len(cross_y)):
        x_y_pairs.append(list(zip(cross_x[i],cross_y[i])))
    flatten_x_y_pairs = [np.array(list(x_y)) for sublist in x_y_pairs for x_y in sublist]
    variances = []
    for i in flatten_x_y_pairs:
        e_vec = (1/np.sqrt(np.dot(i,i.T)))*i
        data_trans = np.dot(data.T,e_vec.T)
        variances.append(np.var(data_trans))
    
        
    index_of_max_variance = np.where(variances == max(variances))[0][0]
    # PLot the variance surface
    ax1.scatter(cross_x,cross_y,np.array(variances).reshape(20,20),alpha=0.8)
    # Mark the point with the highest variance
    vec_point = np.array([x,y])
    e_vec_point = (1/np.sqrt(np.dot(vec_point,vec_point.T)))*vec_point
    data_trans_point = np.dot(data.T,e_vec_point.T)
    ax1.scatter(x,y,np.var(data_trans_point)+0.01,color="orange",s=100)
    
    
    
    
    plt.show()
    
    
interact(f,x=(0.001,1,0.001),y=(0.001,1,0.001))
This gets us the following output:
<function __main__.f>
By playing around with the sliders we can see that slider values of [0.67,0.73] lead to the highest variance. Mind also the gray dashed circles as well as the gray scatter dots on the x axis. If we project the dataset onto the line spanned by the chosen vector and kind of twist around this line such that it aligns with our original x axis, we make the spanned line our new x axis. While altering the vector values we can now observe the spread of these values on the x axis, the more spread out the values are, the higher the variance. This projecting of the dataset and twisting of the spanned line is accomplished by transforming the original dataset using the chosen vector.
 Illustrates the directions x and y But wait, you said projected and transformed... How can we project data? Well, therefore you must be familiar with linear algebra. If you are, fine. If not, [this](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) youtube playlist about the essence of linear algebra might be very useful. So what we are doing is simply calculating the *dot product* (sometimes it is also called scalar product) of our dataset and the chosen vector. So now for simplicity, assume we use an arbitrary dataset and want to project this dataset onto the y axis. To project the dataset we need two things. First, the dataset of shape $nxn$ and a vector of shape $nxm$. We know that after calculating the dot product, the resulting dataset has the dimensionality $nxm$ and hence if $m < n$, we have reduced the dimensionality of our original dataset. Now lets turn this into practice.
Assume we have a $10x2$ dataset and we want to project this dataset onto the line spanned by the vector pointing into the direction of the y axis (This is the same as we have done above). This vector is the unit vector [0,1] which has dimensionality $1x2$. What we want to have is the dataset projected onto the y axis and therewith a dataset with the dimensionality $10x1$. So to accomplish that, we have to calculate the dot product of:
$data*vec^{T}$ where $vec^{T}$ is the transposed unit vector and therewith has no longer the shape $1x2$ but $2x1$ and therewith the resulting dataset has the shape $10x1$. If we now do the described calculations in practice, this looks like:
import numpy as np
data = np.array([[2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1],[2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9]])
ex = np.array([[1,0]])
ey = np.array([[0,1]])
print(data.shape)
print(ex.shape)
print(np.dot(data.T,ey.T)) #  As you can see this is exactly the y component (x dimension) of our dataset
(2, 10)
(1, 2)
[[2.4]
 [0.7]
 [2.9]
 [2.2]
 [3. ]
 [2.7]
 [1.6]
 [1.1]
 [1.6]
 [0.9]]
As you can see, the result is the same as the y values of the original dataset. This is logical since by transforming the data onto the y axis (look at the plot above) we just omit the x values of the dataset. Now assume we do not use the vector [0,1] which is the unit vector pointing into the direction of the y axis but any other arbitrary vector like for instance [0.653,1.2] what happens? Well the calculation are exactly the same: we calculate the dot product of the 10x2 dataset with the 2x1 vector and get a dataset with dimensionality 10x1 projected onto this vector. Though, the line onto which we have projected the data is no longer a vertical line but a slant line with slope 1.2/0.653.

In the above part we talked a lot about the idea behind the PCA as well as how we can find the (first) principal component of a dataset using kind of a graphical trial and error approach where we measured the variance of the projected dataset. Luckily, it turns out that there is a lot more convenient way of finding the principal components of a dataset. You might have noticed that I used the plural of principal component, principal components. A dataset has as many principal components as it has dimensions. That is, a 2D dataset has 2 principal components while a dataset with 3 dimensions has 3 principal components. To find these principal components (as well as which of them is the one which points into the direction of highest variance, which points into the direction of second highest variance and so on) and to finally transform the original dataset choosing the largest $m$ of these $n$ principal components, that is reducing the dimensionality from $n$ to $m$ dimensions, we have to perform in principal five steps:

  1. Collect the data
  2. Normalize the data
  3. Calculate the covariance matrix
  4. Find the eigenvalues and eigenvectors of the covariance matrix
  5. Use the principal components to transform the data - Reduce the dimensionality of the data
    Step 3 to 5 are new to us but trust me, though this way may seem a little out of the blue its worth it. The mystic here is to find the eigenvectors and eigenvalues of the covariance matrix of a dataset. I don't want to delve deep into the maths behind calculating the covariance matrix as well as finding the eigenvectors and eigenvalues of the covariance matrix and why the eigenvalues and eigenvectors of the covaricance matrix turn out to be the principal components of a dataset but just want to give a swift overview. For a mathematical proof on why the eigenvalues and eigenvectors of the covariance matrix turn out to be the principal components of a dataset, I refer the interested reader to the chapter about PCA of Marsland, S. (2015) pp.134. For persons who want to understand the maths behind eigenvectors and eigenvalues, I recommend to look these terms up into text books about linear algebra. But now, step by step (this example is based on Smith, L.I. (2002)):
    $\underline{Covariance \ \ matrix}$
    The covariance matrix is a matrix full of covariances. The covariance matrix is a square matrix of shape $nxn$ and consists of the covariances of each of the $n$ dimensions in a dataset with each other (If we have $n$ dimensions and each dimension is interacting with each other, we have $nxn$ interactions) $$\begin{bmatrix} cov(x_1,x_1) & cov(x_1,x_2) & ... & cov(x_1,x_n\\ cov(x_2,x_1) & cov(x_2,x_2) & ... & cov(x_2,x_n)\\ ...\\ cov(x_n,x_1) & cov(x_n,x_2) & ... & cov(x_n,x_n)) \end{bmatrix}$$
    Where the formula for the covariance is: $$cov(x,y) \ \ = \ \ \frac{\sum_{i=1}^{n}{(x_i-\bar{x})(y_i-\bar{y})}}{n-1}$$
    Here $x_i$ and $y_i$ are for instance $x_1$ and $x_2$ but could also be $x_1$ and $x_1$. If this is the case, that is if $x_i == y_i$ then the covariance is == the variance. This is the case for the elements on the diagonal of the matrix.

    $\underline{Eigenvectors}$
    The formula to find an eigenvectors is:
    $\boldsymbol{\sum{}}*\boldsymbol{\nu} \ \ = \lambda \boldsymbol{\nu}$ where $\sum$ is the covariance matrix, $\nu$ is the so called eigenvector and $\lambda$ is the so called eigenvalue which is a scalar.
    Here we we have two types of multiplication. Vector multiplication on the left hand side and multiplication of a scalar with a vector on the right hand side. To make this a little more equal, we kind of multiply in the identity matrix on the right hand side which does not change the result since the identity matrix has ones on its diagonal and zeros everywhere else. With this step we can rewrite the above equation:
    $\boldsymbol{\sum{}}*\boldsymbol{\nu} \ \ = \lambda\boldsymbol{I}*\boldsymbol{\nu}$
    $(\boldsymbol{\sum{}}*\boldsymbol{\nu})-(\lambda\boldsymbol{I}*\boldsymbol{\nu}) \ \ =\boldsymbol{0} $
    $(\boldsymbol{\sum{}}-\lambda\boldsymbol{I})*\boldsymbol{\nu} \ \ =\boldsymbol{0} $
    Mind that the multiplication of $\lambda\boldsymbol{I}$ is still a multiplication of a scalar with the identity matrix which results in a matrix with $\lambda$ on its diagonal and zeros everywhere else. Each eigenvector has a corresponding eigenvalue and together they are called an eigenpair. The eigenvector with the largest eigenvalue is the Principal Component which points into the direction of the highest variance where the magnitude of the eigenvalue indicates the magnitude of the variance of the dataset in this direction. What does the formula tell us? We have to find a vector $\nu$ which gives, "dot producted" with the covariance matrix $\sum$ the same result as when multiplied with a scalar $\lambda$.
    So in 2D space, the multiplication of the 2D (covariance)matrix $\boldsymbol{\sum}$ with a vector $\boldsymbol{\nu}$ gives the same result as multiplying the vector $\boldsymbol{\nu}$ by a scalar $\boldsymbol{\lambda}$.
    So how can we solve this equation?
    One obvious solution is to set $\boldsymbol{\nu}=\boldsymbol{0}$ but that's boring and trivial isn't it? So what other, non-trivial solutions can we find? First of all, the above equation (we stay in 2 dimensions) gives us a linear system of equations. From linear algebra we know, that the above linear system of equations has only a nontrivial solution ($\boldsymbol{\nu}\neq0$) if $det(\boldsymbol{\sum{}}-\lambda\boldsymbol{I})=0$
    To make this a little bit more clear we use an example from Papula (2015).
    Consider the the following equation which describes a homogeneous linear system of equations:
    $\boldsymbol{A}*\boldsymbol{x}=\boldsymbol{0}$ where you can assume that $A$ is $(\boldsymbol{\sum{}}-\lambda\boldsymbol{I})$ and $x$ is $\nu$
    We can write this in matrix notation as:

    $\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ \end{bmatrix} \ \ * \ \ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \ \ = \ \ \begin{bmatrix} 0 \\ 0 \end{bmatrix}$

And as linear equations:
$a_{11}x_1+a_{12}x_2 \ \ = \ \ 0$
$a_{21}x_1+a_{22}x_2 \ \ = \ \ 0$

We solve this linear system of equations with:
$\begin{align} (I) \ \ a_{11}x_1+a_{12}x_2 \ \ = \ \ 0 \ \ |*a_{22} \\ (II) \ \ a_{21}x_1+a_{22}x_2 \ \ = \ \ 0 \ \ |*a_{12} \\ \hline\\ (I-II) \ \ a_{11}a_{22}x_1-a_{21}a_{12}x_1 \ \ = \ \ 0 \\ (a_{11}a_{22}-a_{21}a_{12})x_1 \ \ = \ \ 0 \end{align}$

Doing the same for $x_2$ by multiplying (I) with $-a_{22}$ and (II) with $a_{12}$ gives:
$(a_{11}a_{22}-a_{21}a_{12})x_2 \ \ = \ \ 0$

Hence for $x_1$ and $x_2$ we have:
$\underbrace{(a_{11}a_{22}-a_{21}a_{12})}_{\text{Det(A)}} \ \ x_1 \ \ = \ \ 0$
$\underbrace{(a_{11}a_{22}-a_{21}a_{12})}_{\text{Det(A)}} \ \ x_2 \ \ = \ \ 0$

As you can see, the first part of the equation is exactly the determinant for a two dimensional matrix and hence only if $Det(A) \ \ = \ \ 0$ that is $Det(\boldsymbol{\sum{}}-\lambda\boldsymbol{I}) \ \ = \ \ 0$ there is a nontrivial solution.

In a geometric sense, the determinant of a matrix represents the change in the area spanned by the unit vectors when transformed with this matrix. Since we want to have $(\boldsymbol{\sum{}}-\lambda\boldsymbol{I})*\boldsymbol{\nu} \ \ =\boldsymbol{0}$, the area spanned by the matrix $(\boldsymbol{\sum{}}-\lambda\boldsymbol{I})$ and the vector $\boldsymbol{\nu}$ must be 0. Hence if a matrix has a determinant of 0, the area (==vector product) of the unit vectors after applying the transformation is $0*old area$. Further, we know that if the spanned area is 0, the vectors must align in one line. We can make this more clear by visualizing the above.

</div>

import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import numpy as np
import matplotlib.patches as patches
def f(lamb):
    # Unit vectors
    e_x = np.array([1,0])
    e_y = np.array([0,1])
    # Area spanned by the unit vectors
    print(np.cross(e_x,e_y)) # Area spanned by the unit vectors == 1
    # Any 2D matrix A of the shape (A-lambda*I)
    A = np.array([[2-lamb,3],[3,0.5-lamb]])
    # Transform the unit vectors by the matrix A --> Unsurprisingly this is exactly the matrix A but otherwise the notation
    # of "the determinant describes the change of the area spanned by the unit vectors after trasnformation" makes no sense 
    # Plot the vectors
    fig = plt.figure(figsize=(10,10))
    ax0 = fig.add_subplot(111)
    ax0.set_xlim(-5,8)
    ax0.set_ylim(-5,8)
    ax0.set_aspect('equal')
    # Vector of matrix A
    ax0.arrow(0,0,A[0][0],A[0][1],color="red",linewidth=1,head_width=0.05) #First vector 
    ax0.arrow(0,0,A[1][0],A[1][1],color="blue",linewidth=1,head_width=0.05) # Second vector
    # Area spanned by the vectors**
    ax0.arrow(A[0][0],A[0][1],A[1][0],A[1][1],color="blue",linestyle='dashed',alpha=0.3,linewidth=1,head_width=0.05)
    ax0.arrow(A[1][0],A[1][1],A[0][0],A[0][1],color="red",linestyle='dashed',alpha=0.3,linewidth=1,head_width=0.05)
    ax0.add_patch(patches.Polygon(xy=[[0,0],[A[0][0],A[0][1]],[A[0][0]+A[1][0],A[0][1]+A[1][1]],[A[1][0],A[1][1]]],fill=True,alpha=0.1,color='yellow'))
    
    # Add text which shows the calculation of the determinant and the area
    ax0.text(3,-0,s=r'$determinant = a_{11}*a_{22}-a_{21}*a_{12}$'+'= {0}'.format(np.round(A[0][0]*A[1][1]-A[1][0]*A[0][1],3)))
    #ax0.text(3,-1,s='area = {0}'.format(np.round(np.cross(A.T[0],A.T[1]),3)))
    ax0.text(3,-0.5,s=r'$determinant$'+'= {0}*{1}-{2}*{3} = {4}'.format(A[0][0],A[1][1],A[1][0],A[0][1],np.round(A[0][0]*A[1][1]-A[1][0]*A[0][1],3)))
    ax0.text(3,-4,s='**Mind that in this case the value of the determinant \n and the area(cross product --> Yellow shaded) are the same \n since the area spanned by the unit vectors is 1',fontsize=8)
    
    
    # Plot the eigenvectors
    ax0.arrow(0,0,0.61505,-0.788491,color="black",linestyle='dashed',alpha=0.3,linewidth=1,head_width=0.05)
    ax0.arrow(0,0,0.78771,0.6159,color="black",linestyle='dashed',alpha=0.3,linewidth=1,head_width=0.05)
    
    # Caclulate (A-lambda I)*nu for different values of lambda using the found eigenvectors. The result must be
    # 0 when nu is perpendicular to (A-lambda I)
    # Mind that for the calculation of v1 and v2 we have to solve the linear system of equations (A-lambda I)*v=0
    # for v1 and v2   
    
    v1 = -3*(((-1+0.5*lamb)/(-9-2*lamb+lamb**2)))/(2-lamb)
    v2 = (-1+0.5*lamb)/(-9-2*lamb+lamb**2)
    v = np.array((1/np.sqrt(v1**2+v2**2))*np.array([v1,v2]))
    ax0.text(3,-1,s=r'$(A-$'+'{0}'.format(lamb)+r'$I)*\nu$'+'= {0}'.format(np.round(np.dot(A,v),3)))
    ax0.arrow(0,0,-v[0]*0.5,-v[1]*0.5,color="green",alpha=0.8,linewidth=1,head_width=0.05) # We draw the eigenvector for lambda
                                                                                        # Mind v[0]*0.5 and v[1]*0.5 --> The *0.5
                                                                                        # is solely done for visualization purposes
    
   
    
    
    
    
    plt.show()
    
interact(f,lamb=(-5,5,0.001))
 
After having executed the Python code above we received the following:
<function __main__.f>
So as you can see, playing around with the sliders leads to two values for $\lambda$ where the determinant is very close to zero. To come back to our original question of solving the eigenvector equation:

$\boldsymbol{A}*\boldsymbol{\nu} \ \ = \ \ \lambda\boldsymbol{\nu}$

we should now be able to input these values into the equation and solve for $\boldsymbol\nu$ to find the corresponding $eigenvectors$ to the found $eigenvalues$. Doing this leads to two unit-vectors (eigenvectors) which we call $e_{\lambda_1}$ and $e_{\lambda_2}$. Mind here, that these unit vectors are standardized with $\frac{1}{|a|}\boldsymbol{a}$ to have unit length.

So assuming, $\boldsymbol{A}$ is the covariance matrix of an arbitrarily dataset, we now have found the eigenvectors and eigenvalues of this dataset. Uff.. We can check this by inserting the found values into the equation:
$(\boldsymbol{A}-\lambda\boldsymbol{I})*\boldsymbol\nu \ \ = \ \ 0$

This must be true! As you can see in the illustration above, the eigenvectors are perpendicular to each other and the vector $(\boldsymbol{A}-\lambda\boldsymbol{I})$ any time the determinant of this vector is zero. This is claimed by $(\boldsymbol{A}-\lambda\boldsymbol{I})*\boldsymbol\nu \ \ = \ \ 0$ since the dot product of two perpendicular vectors is zero. Also, the green vector which illustrates $\nu$ for different values of $\lambda$ aligns with the eigenvectors. As you can see, the moment the red and the green vectors align with each other and the green vector aligns with one of the two black dashed arrows (eigenvectors), the equation is fulfilled and the result of $(\boldsymbol{A}-\lambda\boldsymbol{I})*\boldsymbol\nu$ equals the zero vector.

If it is not totally clear to you how the eigenvectors and eigenvalues are calculated, the geometric interpretation of eigenvectors and eigenvalues can be looked up in the [youtube playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab). But if you think the understanding of eigenvectors and eigenvalues as well as how they are calculated is not that important for you, and you just want to remember that you have to find the eigenvectors and corresponding eigenvalues of the covariance matrix to find the principal components of a dataset, that's ok. For the practical implementation, the most critical part is indeed that you remember that you have to find the eigenvalues and eigenvectors of the covariance matrix of a dataset to find the principal components of the dataset. The above calculations and illustrations should help you to understand how eigenvectors and eigenvalues can be found by hand but normally you want to use prepackaged methods which solve this for you as for instance the [np.linalg.eig](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.eig.html) method which computes the eigenvalues as well as the eigenvectors of a square matrix.

So let's quickly recapitulate where we are on our 5 steps:

  1. Collect the data
  2. Normalize the data
  3. Calculate the covariance matrix
  4. Find the eigenvalues and eigenvectors of the covariance matrix
  5. Use the principal components to transform the data - Reduce the dimensionality of the data
Above we ran through steps 1 to 4, so the transformation of data using the eigenvectors is next.
$\underline{Reduce \ \ the \ \ dimensionality \ \ of \ \ the \ \ data \ \ - \ \ Putting \ \ all \ \ together}$
Once we have found the eigenvectors and eigenvalues of a dataset we can finally use these vectors (which are the principal components oft the dataset) to reduce the dimensionality of the data, that is to project the data onto the principal components.
So let's do this and while doing so run through all of the above steps to show how dimensionality reduction using the PCA can be accomplished with Python from scratch before we use the prepackaged sklearn PCA method. To illustrate this, we will use the [UCI Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris).
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import pandas as pd
"""1. Collect the data"""
df = pd.read_table('data\Wine.txt',sep=',',names=['Alcohol','Malic_acid','Ash','Alcalinity of ash','Magnesium','Total phenols',
                                                 'Flavanoids','Nonflavanoid_phenols','Proanthocyanins','Color_intensity','Hue',
                                                 'OD280/OD315_of_diluted_wines','Proline'])
target = df.index
"""2. Normalize the data""" 
df = StandardScaler().fit_transform(df)
"""3. Calculate the covariance matrix"""        
COV = np.cov(df.T) # We have to transpose the data since the documentation of np.cov() sais 
                   # Each row of `m` represents a variable, and each column a single
                   # observation of all those variables 
"""4. Find the eigenvalues and eigenvectors of the covariance matrix""" 
eigval,eigvec = np.linalg.eig(COV)
print(np.cumsum([i*(100/sum(eigval)) for i in eigval])) # As you can see, the first two principal components contain 55% of
                                                        # the total variation while the first 8 PC contain 90%
"""5. Use the principal components to transform the data - Reduce the dimensionality of the data"""
# The wine dataset is 13 dimensional and we want to reduce the dimensionality to 2 dimensions 
# Therefore we use the two eigenvectors with the two largest eigenvalues and use this vectors 
# to transform the original dataset.
# We want to have 2 Dimensions hence the resulting dataset should be a 178x2 matrix.
# The original dataset is a 178x13 matrix and hence the "principal component matrix" must be of 
# shape 13*2 where the 2 columns contain the covariance eigenvectors with the two largest eigenvalues
PC = eigvec.T[0:2]
data_transformed = np.dot(df,PC.T) # We have to transpose PC because it is of the format 2x178
# Plot the data
fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)
ax0.scatter(data_transformed.T[0],data_transformed.T[1])
for l,c in zip((np.unique(target)),['red','green','blue']):
    ax0.scatter(data_transformed.T[0,target==l],data_transformed.T[1,target==l],c=c,label=l)
ax0.legend()
plt.show()
c:\users\tobia\python\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
[ 36.1988481   55.40633836  66.52996889  73.59899908  80.16229276
  85.09811607  89.3367954   90.13201029  92.81275924  94.11099181
  96.33314522  98.06998091 100.        ]
As you can see, the 13 dimensional dataset has been reduced to a 2 dimensional dataset which still entails 55% of the total variation and which we now can plot into a two dimensional coordinate system. Mind that normally we do not have the target feature values of a dataset since the PCA is an unsupervised learning algorithm. Though, we have included the target feature values here to show that the dataset is still very well separable with only two dimensions. So what we have done above is that we have kind of created new features from the other features by transforming the dataset using the principal components of the dataset and therewith reduced the dimensionality of the dataset (the remaining columns of our transformed dataset serve as new features) without loosing too much information.
The above code is a lot shorter and more convenient than searching for the principal components by hand. Next we will (as always) make this even more efficient using the prepackaged [sklearn PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

PCA using sklearn

from sklearn.decomposition import PCA
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
import pandas as pd
"""1. Collect the data"""
df1 = pd.read_table('data\Wine.txt',sep=',',names=['Alcohol','Malic_acid','Ash','Alcalinity of ash','Magnesium','Total phenols',
                                                 'Flavanoids','Nonflavanoid_phenols','Proanthocyanins','Color_intensity','Hue',
                                                 'OD280/OD315_of_diluted_wines','Proline'])
target1 = df1.index
"""2. Normalize the data""" 
df1 = StandardScaler().fit_transform(df)
"""3. Use the PCA and reduce the dimensionality"""
PCA_model = PCA(n_components=2,random_state=42) # We reduce the dimensionality to two dimensions and set the
                                                            # random state to 42
data_transformed = PCA_model.fit_transform(df1,target)*(-1) # If we omit the -1 we get the exact same result but rotated by 180 degrees --> -1 on the y axis turns to 1.
                                                            # This is due to the definition of the vectors. We can define a vector a as [-1,-1] and as [1,1] 
                                                            # the lines spanned is the same --> remove the *(-1) and you will see
# Plot the data
fig = plt.figure(figsize=(10,10))
ax0 = fig.add_subplot(111)
ax0.scatter(data_transformed.T[0],data_transformed.T[1])
for l,c in zip((np.unique(target)),['red','green','blue']):
    ax0.scatter(data_transformed.T[0,target==l],data_transformed.T[1,target==l],c=c,label=l)
ax0.legend()
plt.show()
As you can see, we need really just a few lines of code to accomplish PCA. Once you have understand the idea behind the PCA you can use this really convenient prepackaged sklearn method without worries to reduce the dimensionality of a dataset to make it either *plottable* or to reduce the size of your dataset without loosing too much of the encoded information. Read the docs to see how you can use the attributes of this function. For instance, you can print the principal axes in feature space as well as the explained variance of each of the selected components. Just play around. Congratulations, if you can follow all the steps above you have understand one of the more complicated machine learning algorithms. Done!

References