Sunday, February 28, 2010

Cluster HeatMaps

I had heard of HeatMaps, but heard of Cluster HeatMaps very recently only when I had to generate one. It is being used a lot in Bioinformatics for gene expression studies. 

Weinstein describes HeatMaps as following: (taken from here)
"In the case of gene expression data, the color assigned to a point in the heat map grid indicates how much of a particular RNA or protein is expressed in a given sample. The gene expression level is generally indicated by red for high expression and either green or blue for low expression. Coherent patterns (patches) of color are generated by hierarchical clustering on both horizontal and vertical axes to bring like together with like. Cluster relationships are indicated by tree-like structures adjacent to the heat map, and the patches of color may indicate functional relationships among genes and samples. "

I could not, for long  time figure out how exactly they are generated. Reading here I got to understand that Cluster HeatMaps are generated by first performing hierarchial clustering for rows (i.e.genes) and then for columns(i.e. samples), generate their dendrograms and then use those for generating heat maps with trees for both genes and samples. I got to learn to do this in R(using hclust, heatmap functions) looking up here.(nice tutorial on R & Bioconductor)


Friday, February 12, 2010

Hierarchical Clustering in R

Clustering is very widely used technique in data analysis. It is classified into Unsupervised learning methos of data analysis in Machine Learning.
In the areas of gene expression data analysis, clustering is very helpful in getting to know about genes. If a gene belongs to a particular cluster and if we already know the functions of those genes present in the cluster, we can say that the gene also has functions as of those particular cluster of genes.
Most widely used clustering algorithms as K-means, Fuzzy-C-means, Hierarchial Clustering.
Hierarchial clustering can further be agglomerative or divisive. Look up here for a simple tutorial on the same.
I tried out Hierarchial clustering on PCR data in R.
I had a matric like this:
sample1 sample2 sample3 sample4 sample5
gene1
gene2
gene3
gene4
gene5

There are expression values (delta ct values)corresponding to each of the combination above. This data is read into R as a matrix usiing read.csv function.

To build hierarchial clusters, we need a Distance matrix or a similarity matrix.
For Distance matrix Euclidean Distance method can be used.
For Similarity matrix Pearson's distance can be used.
Pearson distance is given as d=1-r, where r is the Pearson's correlation coefficient.

In R dist() is the function to be used for euclidean distance.
distmatrix= dist(deltactmatrix,method="euclidean")

dist() doesn't have Pearson's method but there are other functions newly added, but I just did in the following manner.
simmatrix=as.dist(1-cor(deltactmatrix)).

Then use hr <- hclust(distmatrix/simmatrix, method="complete") to get clusters.
Then say plot(hr) to see the dendrograms.

Further steps include cutting the tree that is built at a point so that we can decide what clusters we want. There is a function called cut() specify the cluster and the height to cut at a point and then u get to see the clusters. Validating a cluster is also important and there are several methods (like bootstrapping hclusters & others) which one can find online in several papers.

Wednesday, February 10, 2010

Given an InChI how to get its 3D co-ordinates

Recently, as part of the oreChem project that I am working for, I had to fetch 3D structure in CML format, given an InChI. Some of the InChI's I had were present in PubChem and some of them were not present in PubChem.
For those InChI's present in PubChem:
Its difficult to fetch the 3D structure given an InChI by doing a String search on the PubChem database.I found it time consuming even if there is an md5 Index on it. Easy way I figured out is to fetch the CID of that particular InChI, use that CID and then fetch the 3D structure.
Using eutils REST services appending the InChI in the the url here I got the CID of a given InChI. By Parsing the xml output and extracting the CID and using the following url here I got the 3D structure in XML format.
Another way is to convert InChI into SMILES using InChItoSMILES converter and then use smi23d to get 3D structure in SDF format, then convert SDF format into XML/CML using OpenEyeBabel or any other tool. (in Open Babel simply do "babel source.sdf result.cml")

For those not present in PubChem:
For these InChI's one could use OpenEyeBabel or OpenEyeChem 3D structure generating programs. I had posted this question on Blue Obelisk Stack exchange, a question and answer website for Cheminformaticians(similar to StackOverflow,SemanticOverflow). Here is the answer on how to do it. The 3D co-ordinated generated by OpenBabel and OpenEyeChem were not the same, they will not be same. Look here for the reasons.

If you need to Install Openbabel look here.