Friday, February 12, 2010

Hierarchical Clustering in R

Clustering is very widely used technique in data analysis. It is classified into Unsupervised learning methos of data analysis in Machine Learning.
In the areas of gene expression data analysis, clustering is very helpful in getting to know about genes. If a gene belongs to a particular cluster and if we already know the functions of those genes present in the cluster, we can say that the gene also has functions as of those particular cluster of genes.
Most widely used clustering algorithms as K-means, Fuzzy-C-means, Hierarchial Clustering.
Hierarchial clustering can further be agglomerative or divisive. Look up here for a simple tutorial on the same.
I tried out Hierarchial clustering on PCR data in R.
I had a matric like this:
sample1 sample2 sample3 sample4 sample5
gene1
gene2
gene3
gene4
gene5

There are expression values (delta ct values)corresponding to each of the combination above. This data is read into R as a matrix usiing read.csv function.

To build hierarchial clusters, we need a Distance matrix or a similarity matrix.
For Distance matrix Euclidean Distance method can be used.
For Similarity matrix Pearson's distance can be used.
Pearson distance is given as d=1-r, where r is the Pearson's correlation coefficient.

In R dist() is the function to be used for euclidean distance.
distmatrix= dist(deltactmatrix,method="euclidean")

dist() doesn't have Pearson's method but there are other functions newly added, but I just did in the following manner.
simmatrix=as.dist(1-cor(deltactmatrix)).

Then use hr <- hclust(distmatrix/simmatrix, method="complete") to get clusters.
Then say plot(hr) to see the dendrograms.

Further steps include cutting the tree that is built at a point so that we can decide what clusters we want. There is a function called cut() specify the cluster and the height to cut at a point and then u get to see the clusters. Validating a cluster is also important and there are several methods (like bootstrapping hclusters & others) which one can find online in several papers.

No comments:

Post a Comment