Wednesday, February 19, 2014

Article Review: Systematic evaluation of spliced alignment programs for RNA-seq data

I was recently asked by a colleague to provide some feedback on a Nature Methods paper by Engström et al.  I remember seeing several links to this article when it was first published, so I figure others may also be interesting in seeing my take on the paper.

  • Tested lots of programs
  • Used several benchmarks

  • No experimental validation and/or cross-platform/protocol comparison (for example, Figure 6 defines accuracy based upon overlap with known exon junctions).
    • I think qPCR validation (or microarray data, spike-ins, etc.) would be useful to compare gene expression levels - for example, see validation from Rapport et al. 2013.
  • Limited empirical test data (E-MTAB-1728 for processed values; ERR033015 / ERR033016 for raw data; total n = 2): 1 human cell line sample, 1 mouse brain sample, and simulated data.
    • In contrast, I ran differential expression benchmarks (Warden et al. 2013) comparing 2-group comparisons with much more data: patient cohort with over 100 samples (ERP001058) as well as a 2-group cell line comparison with triplicates (SRP012607).  Likewise, the cell line results also briefly compared RNA-Seq to microarray data in my paper.
  • Accordingly, there are no gene list comparisons, and I think gene expression analysis is probably the most popular type of RNA-Seq analysis
  • Used strand-specific protocol - not sure how robust findings are for other protocols.  For example, I think a lot of data currently being produced is not strand-specific.
  • Only compared at paired end alignments, but (for gene expression analysis) single-end data is probably most common and technically sufficient for gene expression analysis (I can't recall the best possible citation for this, but the Warden et al. paper shows STAR single-end and paired-end to be quite similar).  Results may differ for PE versus SE alignments.  For example, this was the case with Novoalign but not really the case for STAR; however, to be fair, this particular difference could be determined ahead of time from the Novocraft website.

In practice, I would probably choose between TopHat and STAR (two of the most popular options). I would say that this paper confirms my previous benchmarks showing that these two programs are more or less comparable with each other.  When I tested STAR, I noticed some formatting issues: for example, I think the recommended settings weren't sufficient to get it to work with cufflinks, and I think Partek had to do some re-processing to produce the stats in our paper.  I assume these problems should be fixable (and I see no technical problem with STAR), but this is why I haven't already switched to using STAR over TopHat on a regular basis.

The result I found potentially interesting is that it seems like STAR may be better than TopHat for variant calling (none of the analysis in the paper that I published can address this question).  However, I would want to see some true validation results, and I think that most users are not concerned with this (and even fewer have paired DNA-Seq and RNA-Seq data to distinguish genomic variants from RNA-editing events).

To be fair, I don't think this paper was designed to provide the type of benchmarks I was most interested in seeing.  However, I think there was still room to predict testable hypotheses and define accuracy with validation experiments.  For example, the authors could have checked how aligners affect splicing events predicted by tools like MATS, MISO, etc. (as long as they produced the samples used in the benchmarks; alternatively, it wouldn't have been too hard to produce some new data for the purpose of being able to perform validation experiments).

Plus, there was a second paper published in the same issue with a number of the same authors (Steijger et al. 2013).  So, maybe this paper isn't really meant to be read in isolation.  For example, that other paper seems to report considerable discrepancies isoform-level distributions (which matches my own experience that gene-level abundance is preferable for differential expression and splicing event predictions seem more reliable than whole transcript predictions).  In short, I would certainly recommend reading both papers - in addition to others like Rapport et al. 2013, Liu et al. 2014Seyednasrollah et al. 2013Warden et al. 2013, etc.

Tuesday, February 18, 2014

mRNA Quantification via eXpress

eXpress is a tool that allows mRNA quantification using a set of transcripts as a reference (this is opposed to popular RNA-Seq tools like TopHat, which align reads to a genome and have to model gaps caused by exon junctions).

Using transcripts rather than genomic chromosomes as a reference sequences is actually how I imagined RNA-Seq analysis would be conducted, before I learned about standard practices.  In fact, samtools provides an 'idxstats' function that can be used to calculate normalized RPKM expression values.  So, I was curious if the extra modeling done by eXpress is really any better than this simple sort of RPKM calculation: having a more complicated model can potentially improve accuracy, but more complicated models can also leave extra room for things to go wrong, can lead to over-fitting, etc.  For example, I have used eXpress on some de novo assembly data, and I actually found that normal de novo programs seemed to provide better results than those specifically designed for RNA-Seq data (however, to be clear, I think the results of this blog post emphasize that the problem was with the assembly and not the mRNA quantification, as I would have expected).

The short answer is "Yes" - I think it is better to use eXpress over idxstats for calculating RPKM/FPKM values.

To illustrate this, first take a look at the correlations between the eXpress FPKM values and the RPKM values calculated using idxstats:

The correlation isn't horrible, but you can see a non-trivial amount of genes whose expression levels have consistently lower in eXpress than idxstats.  However, this by itself doesn't really prove one options is better than the other option.  Because I feel comfortable with the gene-level mRNA quantification levels from cufflinks (and the RSEM-like algorithm implemented in Partek; for example, see Figure 5 in this paper or click here to see a direct correlation between these two results), I decided to see how the results compared when using different tools for a transcript-based reference (eXpress, idxstats) versus a genomic/chromosome-based reference (cufflinks, Partek).

Again, you see these outliers if you compare the idxstats results to cufflinks (or to Partek - click here for those results):

However, you don't see these outliers when comparing eXpress to cufflinks (or to Partek - again, click here for those results):

So, eXpress clearly provides more robust results than the simpler idxstats comparison.  You can also see this in box plot below, showing the correlation coefficients for all the mRNA quantification strategies that I tested.

Of course, systematic differences between mRNA quantification methods should (at least partially) be corrected when identifying differentially expressed genes between two groups (because the differences affect both groups).  However, there are some certain circumstances when the mRNA quantification levels may want be used in isolation, such as for ranking the most highly expressed genes in a sample (as was the case for the de novo assembly data that I worked with).  In this situations, I would definitely recommend a tool like eXpress over trying to calculate RPKM values from tools like idxstats.

FYI, here are some details on the methodology for this comparison:
  • MiSeq samples from GSE37703 were used for these comparisons.
  • Correlations were calculated using log2(FPKM/RPKM + 0.1) expression values.
  • eXpress and idxstats were run on Bowtie2 alignments of the same set of RefSeq transcripts (downloaded from the UCSC Genome Browser, with duplicated gene IDs removed).  The Partek EM algorithm used a set of RefSeq sequences used by the vendor and cufflinks used the genes.gtf file downloaded from iGenomes on the TopHat website.  Only commonly represented gene symbols were used for calculating correlations.  Only genes declared "solvable" by eXpress were considered for calculating correlations.  As an example, click here to view a venn diagram of overlapping gene symbols for SRR493372.
P.S. It looks like you may have to be signed into Google Docs to view the image previews properly.  However, you can always download the files to view them locally.
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.