Thursday, April 17, 2014

Differential Expression Without A Reference Genome

I've noticed a lot of Biostar questions related to conducting differential expression following de novo assembly of RNA-Seq data, so I wanted to create a blog post with a collection of my suggestions.

It is tempting to want to map the assembled transcripts between samples for differential expression, but I wouldn't recommend this because there will often not be a 1:1 mapping between assembled transcripts in different samples.  Instead, these are my suggestions:

Differential Expression Strategies:

  • Use one assembly (either from a control sample or a pooled collection of reads from all samples).  Then, use a reference based alignment (using an aligner like Bowtie or BWA) against this assembly for each sample.  You can perform mRNA quantification using a tool like eXpress, and then you can use your favorite differential expression tool (I would recommend DESeq or limma, among the popular options)
  • Use a kmer-based option (like NIKS, RUFUS, etc.).  Here, the idea is to look for differentially represented kmers and then perform de novo assembly on only the kmers that differ between samples/groups.
  • I haven't tried it, but Corset looking like an interesting option


  • I actually found CLC de novo to be the best de novo assembly tool, even though it wasn't specifically designed for RNA-Seq data.  It also automatically provides contig coverage statistics.
    • In my case, I defined the quality of the results based upon the most highly expressed contigs in various tissues (looking for genes that you could expect to be highly expressed in those different tissue types)
  • Among the open-source RNA-Seq de novo assembly options, I would recommend Oases.  In fact, you might find the merged Velvet contigs to be more useful than the transcripts (either way, you will have access to both options)
  • I would recommend against using Trinity, even though that is a popular option.  Based upon my personal experience, I would say that it often stitches together contigs from different genes, producing many very large transcripts (some fusion genes should occur, but not at the rate I saw large transcripts in Trinity)

Relevant Biostar Posts:

Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.