RNA-Seq gene expression estimation with read mapping uncertainty

  • Bo Li
    1 Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, 2 Morgridge Institute for Research, Madison, WI 53707 and 3 Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, USA
  • Victor Ruotti
    1 Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, 2 Morgridge Institute for Research, Madison, WI 53707 and 3 Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, USA
  • Ron M. Stewart
    1 Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, 2 Morgridge Institute for Research, Madison, WI 53707 and 3 Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, USA
  • James A. Thomson
    1 Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, 2 Morgridge Institute for Research, Madison, WI 53707 and 3 Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, USA
  • Colin N. Dewey
    1 Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, 2 Morgridge Institute for Research, Madison, WI 53707 and 3 Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI 53706, USA

抄録

<jats:title>Abstract</jats:title> <jats:p>Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.</jats:p> <jats:p>Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.</jats:p> <jats:p>Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem.</jats:p> <jats:p>Contact:  cdewey@biostat.wisc.edu</jats:p> <jats:p>Supplementary information:  Supplementary data are available at Bioinformatics on</jats:p>

収録刊行物

  • Bioinformatics

    Bioinformatics 26 (4), 493-500, 2009-12-18

    Oxford University Press (OUP)

被引用文献 (24)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ