Friday, May 05, 2006

May 8th - HMB13.356 10-11 am - Short blocks from the noncoding parts of the human genome have instances within nearly all known genes

Wow - the implications are enourmous!

Brad Broom just found this paper and we all look forward to hear him Monday (thank you Brad for organizing this with such a short notice):

PNAS | April 25, 2006 | vol. 103 | no. 17 | 6605-6610

Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes

Isidore Rigoutsos*, Tien Huynh, Kevin Miranda, Aristotelis Tsirigos, Alice McHardy, and Daniel Platt

IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598

Communicated by Thomas E. Shenk, Princeton University, Princeton, NJ, March 4, 2006 (received for review November 16, 2005)

Using an unsupervised pattern-discovery method, we processed the human intergenic and intronic regions and catalogued all variable-length patterns with identically conserved copies and multiplicities above what is expected by chance. Among the millions of discovered patterns, we found a subset of 127,998 patterns, termed pyknons, which have additional nonoverlapping instances in the untranslated and protein-coding regions of 30,675 transcripts from 20,059 human genes. The pyknons arrange combinatorially in the untranslated and coding regions of numerous human genes where they form mosaics. Consecutive instances of pyknons in these regions show a strong bias in their relative placement, favoring distances of {approx}22 nucleotides. We also found pyknons to be enriched in a statistically significant manner in genes involved in specific processes, e.g., cell communication, transcription, regulation of transcription, signaling, transport, etc. For {approx}1/3 of the pyknons, the intergenic/intronic instances of their reverse complement lie within 380,084 nonoverlapping regions, typically 60–80 nucleotides long, which are predicted to form double-stranded, energetically stable, hairpin-shaped RNA secondary structures; additionally, the pyknons subsume {approx}40% of the known microRNA sequences, thus suggesting a possible link with posttranscriptional gene silencing and RNA interference. Cross-genome comparisons reveal that many of the pyknons have instances in the 3' UTRs of genes from other vertebrates and invertebrates where they are overrepresented in similar biological processes, as in the human genome. These unexpected findings suggest potential unique functional connections between the coding and noncoding parts of the human genome.

junk DNA | pattern discovery | posttranscriptional gene silencing | pyknons | RNA interference

Thursday, April 06, 2006

April 24 - HMB13.356 10-11 am - Unsupervised Learning with Random Forest Predictors

Our Graduate student, Yuliya Karpievitch, who is deep in the Random Forest alrogithms will be telling us about the use of this technique by presenting the latest as of Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest Predictors Journal of Computational and Graphical Statistics. (the link also gives you access to related tools).

here's the abstract to get you started:

A random forest (RF) predictor (Breiman 2001) is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabelled data: the idea is to construct an RF predictor that distinguishes the `observed' data from suitably generated synthetic data (Breiman 2003). The observed data are the original unlabelled data while the synthetic data are drawn from a reference distribution. Recently, RF dissimilarities have been used successfully in several unsupervised learning tasks involving genomic data. Unlike standard dissimilarities, the relationship between the RF dissimilarity and the variables can be difficult to disentangle. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, is robust to outlying observations, and accommodates several strategies for dealing with missing data. The RF dissimilarity easily deals with large number of variables due to its intrinsic variable selection, e.g. the Addcl1 RF dissimilarity weighs the contribution of each variable on the dissimilarity according to how dependent it is on other variables. We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules.

Friday, March 17, 2006

April 10th, 10am-11am [HMB13.356] - Gene Mapping and Marker Clustering Using Shannon's Mutual Information

This seccond entry in the Bioinformatics journal club new series will be presented by Brad Broom.

Here's the abstract to spike your interest:

Publication Home Page
January-March 2006 (Vol. 3, No. 1) pp. 47-56
Gene Mapping and Marker Clustering Using Shannon's Mutual Information

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TCBB.2006.9

ABSTRACT

Finding the causal genetic regions underlying complex traits is one of the main aims in human genetics. In the context of complex diseases, which are believed to be controlled by multiple contributing loci of largely unknown effect and position, it is especially important to develop general yet sensitive methods for gene mapping. We discuss the use of Shannon's information theory for population-based gene mapping of discrete and quantitative traits and for marker clustering. Various measures of mutual information were employed in order to develop a comprehensive framework for gene mapping analyses. An algorithm aimed at finding so-called relevance chains of causal markers is proposed. Moreover, entropy measures are used in conjunction with multidimensional scaling to visualize clusters of genetic markers. The relevance chain algorithm successfully detected the two causal regions in a simulated scenario. The approach has also been applied to a published clinical study on autoimmune (Graves') disease. Results were consistent with those of standard statistical methods, but identified an additional locus of interest in the promotor region of the associated gene CTLA4. The developed software is freely available at http://www.lnt.ei.tum.de/download/InfoGeneMap/.


Next meeting is April 17TH. The paper is not set yet so send your suggestions.

March 20, 2006, 10-11am [HMB13.356]: follicular lymphoma biomarkers, a classic

"Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells." by Sandeep S. Dave, ..., Louis M. Staut, in N Engl J Med. 2004 Nov 18;351(21):2159-69 is our first entry. The presenter will be Li Zhang who will start the new biomarker journal club series with a classic.

Here's the abstract to spike your interest:

background
Patients with follicular lymphoma may survive for periods of less than 1 year to more than 20 years after diagnosis. We used gene-expression profiles of tumor-biopsy specimens obtained at diagnosis to develop a molecular predictor of the length of survival.
methods
Gene-expression profiling was performed on 191 biopsy specimens obtained from patients with untreated follicular lymphoma. Supervised methods were used to discover expression patterns associated with the length of survival in a training set of 95 specimens. A molecular predictor of survival was constructed from these genes and validated in an independent test set of 96 specimens.
results
Individual genes that predicted the length of survival were grouped into gene-expression signatures on the basis of their expression in the training set, and two such signatures were used to construct a survival predictor. The two signatures allowed patients with specimens in the test set to be divided into four quartiles with widely disparate median lengths of survival (13.6, 11.1, 10.8, and 3.9 years), independently of clinical prognostic variables. Flow cytometry showed that these signatures reflected gene expression by nonmalignant tumor-infiltrating immune cells.
conclusions
The length of survival among patients with follicular lymphoma correlates with the molecular features of nonmalignant immune cells present in the tumor at diagnosis.


Next meeting is April 3rd. The paper is not set yet so send your suggestions.