A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance

DOI Web Site Web Site PubMed 参考文献25件 オープンアクセス
  • Abe Takashi
    Department of Information Engineering, Faculty of Engineering, Niigata University
  • Ikarashi Ryo
    Department of Information Engineering, Faculty of Engineering, Niigata University
  • Mizoguchi Masaya
    Department of Information Engineering, Faculty of Engineering, Niigata University
  • Otake Masashi
    Department of Information Engineering, Faculty of Engineering, Niigata University
  • Ikemura Toshimichi
    Department of Bioscience, Nagahama Institute of Bio-Science and Technology

この論文をさがす

抄録

<p>As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.</p>

収録刊行物

参考文献 (25)*注記

もっと見る

関連プロジェクト

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ