Thursday, April 06, 2006

April 24 - HMB13.356 10-11 am - Unsupervised Learning with Random Forest Predictors

Our Graduate student, Yuliya Karpievitch, who is deep in the Random Forest alrogithms will be telling us about the use of this technique by presenting the latest as of Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors Journal of Computational and Graphical Statistics. (the link also gives you access to related tools).

here's the abstract to get you started:

A random forest (RF) predictor (Breiman 2001) is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabelled data: the idea is to construct an RF predictor that distinguishes the `observed' data from suitably generated synthetic data (Breiman 2003). The observed data are the original unlabelled data while the synthetic data are drawn from a reference distribution. Recently, RF dissimilarities have been used successfully in several unsupervised learning tasks involving genomic data. Unlike standard dissimilarities, the relationship between the RF dissimilarity and the variables can be difficult to disentangle. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, is robust to outlying observations, and accommodates several strategies for dealing with missing data. The RF dissimilarity easily deals with large number of variables due to its intrinsic variable selection, e.g. the Addcl1 RF dissimilarity weighs the contribution of each variable on the dissimilarity according to how dependent it is on other variables. We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.

2 Comments:

Blogger Helena Deus said...

Very productive brainstorming on today's meeting, sorry you weren't there Jonas ;)
- we discussed whether the journal club should start collecting some of the very interesting ideas that are discussed and try to publish

5:38 PM  
Blogger Jonas Almeida said...

That is an intrigging idea, specialy now that NCBI's director is endorsing a new type of open publication where the peer reviewing is public and not anonymous, as well as comments by readers. Have a look at Biology Direct for more info. It sounds like it could bridge the gap between BLOGs like this one and more rigid community endorsed journals.

11:52 AM  

Post a Comment

<< Home