We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cond-mat

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Condensed Matter > Disordered Systems and Neural Networks

Title: Seven clusters in genomic triplet distributions

Abstract: In several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring learning dataset of already known genes. The fact that unsupervised gene-detection is possible closely connected to existence of a cluster structure in oligomer frequency distributions. In this paper we study cluster structure of several genomes in the space of their triplet frequencies, using pure data exploration strategy. Several complete genomic sequences were analyzed, using visualization of tables of triplet frequencies in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions with high accuracy (higher than 90% on the nucleotide level). Visualizing and understanding the structure allows to analyze effectively performance of different gene-prediction tools. Since the method does not require extraction of ORFs, it can be applied even for unassembled genomes. The information content of the triplet distributions and the validity of the mean-field models are analysed.
Comments: Correction of URL. 16 pages, 5 figures. The software and datasets are available at this http URL and this http URL Paper also available at this http URL
Subjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph); Data Analysis, Statistics and Probability (physics.data-an); Genomics (q-bio.GN)
Journal reference: In Silico Biology, 3 (2003), 0039, 471-482
Cite as: arXiv:cond-mat/0305681 [cond-mat.dis-nn]
  (or arXiv:cond-mat/0305681v4 [cond-mat.dis-nn] for this version)

Submission history

From: Gorban [view email]
[v1] Thu, 29 May 2003 11:36:34 GMT (577kb)
[v2] Wed, 14 Apr 2004 17:01:56 GMT (527kb)
[v3] Mon, 1 Nov 2004 11:08:03 GMT (527kb)
[v4] Tue, 23 Nov 2004 13:09:00 GMT (542kb)

Link back to: arXiv, form interface, contact.