Monday, June 7, 2010

PCA & Biplots

 For anyone who is new to PCA and would like to know the math behind it, these are the two papers one must probably read. One is a tutorial by Lindsay. I.Smith and another is a very good paper by Jonathan Shlens.


PCA is widely used in microarray gene expression studies as well as PCR to understand the behavior of genes in different samples. Say one is studying expression of 1000 genes in 5 different patients; these values form a matrix of 5 x 1000 matrix. If each of these 1000 genes is plotted in a multi-dimensional scatter plot consisting of 5 axes, 1 for each patient, the result will be a cloud of values in multi-dimensional space. PCA extracts the directions where the cloud is more extended. PCA can be performed on genes as well as samples depending on the type of analysis one wants to perform. If there is a serial type or dose experiment with time or concentration as parameter, then to find out principal gene expression profiles PCA can be performed on genes. If one wants to find out prevalent expression profiles among samples regardless of individual genes’ expression patterns, PCA on Conditions can be performed. (read here) Consider a case where one is using PCR to study how fifteen different genes are expressed in each of four strains of a bacteria and say this was done for eight time points. This is a case of multiway studies and in this case matrix-augmented PCA can be performed to understand expression of genes across four different strains at once.(read here)

In R, prcomp() is the function commonly used to perform PCA. The input to this function is the gene expression matrix or deltact values matrix (in PCR) with genes as rows, samples as columns (or vice versa) and the expression values or deltact vales as the respective elements in the matrix.
Once the PCA is performed usually the results are interpreted by plotting scatter plots. And usually PC1, PC2, PC3 are the three components considered as they most of the times have all the variation in the data expressed.

A biplot is a scatter plot which allows information on both samples and variables(or genes) to be displayed graphically. Usually samples are displayed as points while variables are displayed as vectors. Depending on the PCA being performed, the points can be genes or conditions/samples and so are vectors.
In R, biplot() function is used to generate the biplot. It by-default generates scatter plot of PC1 and PC2 showing their respective loadings.
plot (inputmatrix$x[,2:3]) gives a scatter plot of PC2 vs PC3.

Interpreting Biplots:
A biplot has samples as points and variables as vectors.
Depending on the angle between two vectors a correlation can be inferred between the variables. Thus if two vectors are at acute angle to each other, it can be said that the two variables are positively correlated. If the angle is obtuse, they are said to be negatively correlated and if the angle is right angle there is said to be no correlation between the two variables. This way a biplot can be used to understand how genes differ in their expression in two different samples/conditions and also tell if two genes are correlated or not.

Wednesday, March 10, 2010

Links to Installing Virtuoso on Windows

In one of my earlier posts I wrote about installing Virtuoso triple store on Unix. In this post I thought I will put the links here for my future reference.

32-bit or 64-bit binary files are available here.

Installation steps are very well documented here. However do the Setup DSN only after creating a windows service for the default database.

Virtuoso is a fully functional database. It has in-built table called RDF_QUAD where it stores the triples.

If you have a scenario where in you want to convert the data that you have in your relational database into triples, you could use virtuoso to do it, however its not part of the open source release, but part of their commercial release.

Sunday, February 28, 2010

Cluster HeatMaps

I had heard of HeatMaps, but heard of Cluster HeatMaps very recently only when I had to generate one. It is being used a lot in Bioinformatics for gene expression studies. 

Weinstein describes HeatMaps as following: (taken from here)
"In the case of gene expression data, the color assigned to a point in the heat map grid indicates how much of a particular RNA or protein is expressed in a given sample. The gene expression level is generally indicated by red for high expression and either green or blue for low expression. Coherent patterns (patches) of color are generated by hierarchical clustering on both horizontal and vertical axes to bring like together with like. Cluster relationships are indicated by tree-like structures adjacent to the heat map, and the patches of color may indicate functional relationships among genes and samples. "

I could not, for long  time figure out how exactly they are generated. Reading here I got to understand that Cluster HeatMaps are generated by first performing hierarchial clustering for rows (i.e.genes) and then for columns(i.e. samples), generate their dendrograms and then use those for generating heat maps with trees for both genes and samples. I got to learn to do this in R(using hclust, heatmap functions) looking up here.(nice tutorial on R & Bioconductor)


Friday, February 12, 2010

Hierarchical Clustering in R

Clustering is very widely used technique in data analysis. It is classified into Unsupervised learning methos of data analysis in Machine Learning.
In the areas of gene expression data analysis, clustering is very helpful in getting to know about genes. If a gene belongs to a particular cluster and if we already know the functions of those genes present in the cluster, we can say that the gene also has functions as of those particular cluster of genes.
Most widely used clustering algorithms as K-means, Fuzzy-C-means, Hierarchial Clustering.
Hierarchial clustering can further be agglomerative or divisive. Look up here for a simple tutorial on the same.
I tried out Hierarchial clustering on PCR data in R.
I had a matric like this:
sample1 sample2 sample3 sample4 sample5
gene1
gene2
gene3
gene4
gene5

There are expression values (delta ct values)corresponding to each of the combination above. This data is read into R as a matrix usiing read.csv function.

To build hierarchial clusters, we need a Distance matrix or a similarity matrix.
For Distance matrix Euclidean Distance method can be used.
For Similarity matrix Pearson's distance can be used.
Pearson distance is given as d=1-r, where r is the Pearson's correlation coefficient.

In R dist() is the function to be used for euclidean distance.
distmatrix= dist(deltactmatrix,method="euclidean")

dist() doesn't have Pearson's method but there are other functions newly added, but I just did in the following manner.
simmatrix=as.dist(1-cor(deltactmatrix)).

Then use hr <- hclust(distmatrix/simmatrix, method="complete") to get clusters.
Then say plot(hr) to see the dendrograms.

Further steps include cutting the tree that is built at a point so that we can decide what clusters we want. There is a function called cut() specify the cluster and the height to cut at a point and then u get to see the clusters. Validating a cluster is also important and there are several methods (like bootstrapping hclusters & others) which one can find online in several papers.

Wednesday, February 10, 2010

Given an InChI how to get its 3D co-ordinates

Recently, as part of the oreChem project that I am working for, I had to fetch 3D structure in CML format, given an InChI. Some of the InChI's I had were present in PubChem and some of them were not present in PubChem.
For those InChI's present in PubChem:
Its difficult to fetch the 3D structure given an InChI by doing a String search on the PubChem database.I found it time consuming even if there is an md5 Index on it. Easy way I figured out is to fetch the CID of that particular InChI, use that CID and then fetch the 3D structure.
Using eutils REST services appending the InChI in the the url here I got the CID of a given InChI. By Parsing the xml output and extracting the CID and using the following url here I got the 3D structure in XML format.
Another way is to convert InChI into SMILES using InChItoSMILES converter and then use smi23d to get 3D structure in SDF format, then convert SDF format into XML/CML using OpenEyeBabel or any other tool. (in Open Babel simply do "babel source.sdf result.cml")

For those not present in PubChem:
For these InChI's one could use OpenEyeBabel or OpenEyeChem 3D structure generating programs. I had posted this question on Blue Obelisk Stack exchange, a question and answer website for Cheminformaticians(similar to StackOverflow,SemanticOverflow). Here is the answer on how to do it. The 3D co-ordinated generated by OpenBabel and OpenEyeChem were not the same, they will not be same. Look here for the reasons.

If you need to Install Openbabel look here.

Wednesday, January 27, 2010

Schema less database: CouchDB

CouchDB is a schema less , document oriented database. Documents are in JSON format.

Here is a nice video on the same.

A good summary on CouchDb here.
Here is the CouchDB page.



"PubCouch " : A CouchDB interface to PubChem : posted by Rich on his blog here.

Thursday, December 3, 2009

Correct way of Shutting down Virtuoso server

Here is the correct way to shut down the virtuoso server.
At the SQL prompt type SHUTDOWN. This is the normal way of shutting down the virtuoso server.
For virtuoso in particular killing the virtuoso process would result in all transactions since the last checkpoint not being comitted to the database and remaining in the transaction log file, which would then have to be replayed when the server is next started. Whereas performing a normal shutdown will automatically run a checkpoint to commit all outstanding transactions to the database before shutting down.

For more information read at http://docs.openlinksw.com/virtuoso/isql.html#invokingisql