Thursday, April 06, 2006

April 24 - HMB13.356 10-11 am - Unsupervised Learning with Random Forest Predictors

Our Graduate student, Yuliya Karpievitch, who is deep in the Random Forest alrogithms will be telling us about the use of this technique by presenting the latest as of Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors Journal of Computational and Graphical Statistics. (the link also gives you access to related tools).

here's the abstract to get you started:

A random forest (RF) predictor (Breiman 2001) is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabelled data: the idea is to construct an RF predictor that distinguishes the `observed' data from suitably generated synthetic data (Breiman 2003). The observed data are the original unlabelled data while the synthetic data are drawn from a reference distribution. Recently, RF dissimilarities have been used successfully in several unsupervised learning tasks involving genomic data. Unlike standard dissimilarities, the relationship between the RF dissimilarity and the variables can be difficult to disentangle. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, is robust to outlying observations, and accommodates several strategies for dealing with missing data. The RF dissimilarity easily deals with large number of variables due to its intrinsic variable selection, e.g. the Addcl1 RF dissimilarity weighs the contribution of each variable on the dissimilarity according to how dependent it is on other variables. We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.