Monday, June 7, 2010

PCA & Biplots

 For anyone who is new to PCA and would like to know the math behind it, these are the two papers one must probably read. One is a tutorial by Lindsay. I.Smith and another is a very good paper by Jonathan Shlens.


PCA is widely used in microarray gene expression studies as well as PCR to understand the behavior of genes in different samples. Say one is studying expression of 1000 genes in 5 different patients; these values form a matrix of 5 x 1000 matrix. If each of these 1000 genes is plotted in a multi-dimensional scatter plot consisting of 5 axes, 1 for each patient, the result will be a cloud of values in multi-dimensional space. PCA extracts the directions where the cloud is more extended. PCA can be performed on genes as well as samples depending on the type of analysis one wants to perform. If there is a serial type or dose experiment with time or concentration as parameter, then to find out principal gene expression profiles PCA can be performed on genes. If one wants to find out prevalent expression profiles among samples regardless of individual genes’ expression patterns, PCA on Conditions can be performed. (read here) Consider a case where one is using PCR to study how fifteen different genes are expressed in each of four strains of a bacteria and say this was done for eight time points. This is a case of multiway studies and in this case matrix-augmented PCA can be performed to understand expression of genes across four different strains at once.(read here)

In R, prcomp() is the function commonly used to perform PCA. The input to this function is the gene expression matrix or deltact values matrix (in PCR) with genes as rows, samples as columns (or vice versa) and the expression values or deltact vales as the respective elements in the matrix.
Once the PCA is performed usually the results are interpreted by plotting scatter plots. And usually PC1, PC2, PC3 are the three components considered as they most of the times have all the variation in the data expressed.

A biplot is a scatter plot which allows information on both samples and variables(or genes) to be displayed graphically. Usually samples are displayed as points while variables are displayed as vectors. Depending on the PCA being performed, the points can be genes or conditions/samples and so are vectors.
In R, biplot() function is used to generate the biplot. It by-default generates scatter plot of PC1 and PC2 showing their respective loadings.
plot (inputmatrix$x[,2:3]) gives a scatter plot of PC2 vs PC3.

Interpreting Biplots:
A biplot has samples as points and variables as vectors.
Depending on the angle between two vectors a correlation can be inferred between the variables. Thus if two vectors are at acute angle to each other, it can be said that the two variables are positively correlated. If the angle is obtuse, they are said to be negatively correlated and if the angle is right angle there is said to be no correlation between the two variables. This way a biplot can be used to understand how genes differ in their expression in two different samples/conditions and also tell if two genes are correlated or not.