Sunday, June 8, 2014

Questions About Genetic Testing

I've recently been asked some questions about genetic testing, some of which I think are worth mentioning in a public discussion (although please see the note above if you are interested in getting specific recommendations):

1) How can I use my DNA sequence to inform my decision making about reducing my risk of getting a disease?

In many cases, I think the actions that you can take to reduce disease risk are relatively generic (exercise, eat lots of fruits and vegetables, etc.), and these are things that you should do regardless of your genotype.

That said, there certainly are some circumstances where very specific action can be taken, based upon your genome sequence.  I am not personally aware of all such examples, and I would recommend talking to a medical professional (such as a genetic counselor) for more information.  If this is the only type of information that you wish to learn about your genome, then you may only benefit from determining the sequence for a small portion of your genome (and your family history can guide the likelihood of needing to perform any genetic tests in a clinical setting).

2) How does Promethease compare to the health reports from 23andMe?  Does Promethease mostly focus on rare mutations?

Promethease is based upon annotations that come from SNPedia (similar to wikipedia, but specifically for mutation annotations).  So, I would expect the content would depend on whatever information is entered by volunteers.  Given the amount of information that I see from by 23andMe data (which is mostly common variants), I would say it contains a large amount of information on common variants.

When I first ran Promethease, I remembered mostly seeing risk annotations (which you can see in my old post, comparing my top 23andMe risk associations), and I didn't remember seeing anything like the carrier status report.  For example, I didn't remember seeing anything reporting me as a carrier for cystic fibrosis, which is an example of a 23andMe result that was in good concordance with my family history.  However, I went back to check my specific mutation (394delTT), with a probe ID i4000313.  This particular probe ID makes it a bit harder to match the mutation.  However, if I Google the probe ID, I can see that is is included in SNPedia but without a detailed description.  If I look up 394delTT in ClinVar, then I can get more information about this variant and I can see that it corresponds to rs121908769.  If I then look this variant up in SNPedia, I can see that it provides some information from ClinVar (although there is nothing in the brief main text for mentioning cystic fibrosis), but I don't see it among any of the "cystic fibrosis" variants in my old Promethease report.

However, to be fair, I wanted to re-run my sample through Promethase to see if the reporting system has changed and/or confirm that it now recognizes this mutation in my data.  It appears that a many new features have been added within the last 3+ years, such as a more interactive interface (viewed by clicking "UI version 2" in the downloadable report). Additionally, I can tell that the "medicines" and "medical conditions" have been expanded to include more SNPs.   However, it still don't recognize my cystic fibrosis carrier status, so it can't simply considered a replacement for the old 23andMe report.

Also, in general, the information available in Promethease tends to be terse, and it won't contain the same level of detail for explaining basic concepts as would have been provided by the old 23andMe health report.

3) More specifically, I do not wish to see a doctor in order to obtain my genetic information.  Also, I want something clear and easy to understand, which doesn't require doing any additional research.  What are my options, now that 23andMe no longer offers health reports?

I am not aware of any direct-to-consumer test that provides information that is comparable to the old 23andMe health reports, and I am not aware of any tool to analyze your raw 23andMe data that will reproduce your the old 23andMe health report (especially not with all the details to help make the results easier to understand).  I believe that even Illumina's Understand Your Genome program requires meeting with a doctor to draw your blood, conduct a predisposition screen, and discuss your results.

Perhaps more importantly, I think it is worth emphasizing that the health reports previously offered by 23andMe were not really a single report: they were updated periodically as new findings were published in the genomics literature, so your estimated risk would change over time for many traits.  This should be generally true for any tool connecting you to the genomics literature, as also demonstrated for the Promethease example above.

If you were only interested in the subset of results that are unlikely to change, I think you would probably have been most interested in the carrier status reports (and a handful of the other reports).  There are tests like Counsyl that I would expect to probably be similar to the the 23andMe carrier status results (and it is something that I would want to check out, if I was planning on having a child), but I believe that you can only get that test through your doctor.  However, this is just one example: I would recommend talking to a doctor or genetic counselor if you want more specific guidance.

In my opinion, I am most uncomfortable with the request to get a result that "doesn't require doing any additional research".  Critical thinking and being able to synthesize your own opinion from multiple sources of information are important skills that should be part of everyday life, and I think "additional research" is especially important for tools designed for "research and educational purposes" (including 23andMe, Promethease, Interpretome, etc.).  For example, clinical action may be limited because 1) all the genetic influences of disease risk are not known, 2) ways to mitigate genetic risk may not be known, and 3) one current limitation to low-cost options like SNP chips is that you aren't measuring your entire genome sequence (so, some important sequences may not be covered).  This might be a problem for some people, but I think it is still OK for many people.

In other words, there certainly have been some cases where people discovered important findings from their 23andMe reports that were worth verifying in a clinical setting, but I think most 23andMe customers took no medical action based upon their reports.  Most importantly, "no medical action" need not equate to "dissatisfied": I would personally fall the category of a customer who was "very satisfied" yet has not changed by behavior because of any of the results.  I think trying to understand how your biology is influenced by your genome sequence is a life-long goal that will probably never be fully realized, but I think there is value in being able to understand on-going genome research through the context of your own genome.

Monday, June 2, 2014

Getting Advice About Genetic Testing

I recently received a call at work from an individual asking for advice about genetic testing.  While I welcome questions and comments on my blog posts, I don't think this was an appropriate course of action.  I don't think it is a huge problem (this is the first time this happened in the 3+ years since I wrote my most commonly viewed blog post), and I think this individual raised some interesting questions.  However, I think it may be important provide a brief description of myself and the role of my blog:

 I am an analyst that often studies (de-identified) patient samples for biomedical research, and it is probably safe to say I have an above-average knowledge about genetics.  However, I am not a medical professional, and I never provide consulting about genetic testing at any point during my job.  In fact, a large portion of my research involves projects that do not look for mutations in a subject's DNA sequence.  In other words, everything I write on my blog about genetic testing is my own personal perspective, independent of any professional responsibilities.

Of course, that leaves the obvious question:

Where can I find someone to discuss genetic testing options?

I would recommend contacting a genetic counselor if you wish to talk to somebody one-on-one about genetic testing options and what action can be taken as a result of those tests.  The National Society of Genetic Counselors has a tool to find a genetic counselor near you:

Again, I should emphasize that I am not a genetic counselor myself: if you are genetic counselor or medical geneticist with a better recommendation, I would encourage you to post a comment to this blog post.

Thursday, April 17, 2014

Differential Expression Without A Reference Genome

I've noticed a lot of Biostar questions related to conducting differential expression following de novo assembly of RNA-Seq data, so I wanted to create a blog post with a collection of my suggestions.

It is tempting to want to map the assembled transcripts between samples for differential expression, but I wouldn't recommend this because there will often not be a 1:1 mapping between assembled transcripts in different samples.  Instead, these are my suggestions:

Differential Expression Strategies:

  • Use one assembly (either from a control sample or a pooled collection of reads from all samples).  Then, use a reference based alignment (using an aligner like Bowtie or BWA) against this assembly for each sample.  You can perform mRNA quantification using a tool like eXpress, and then you can use your favorite differential expression tool (I would recommend DESeq or limma, among the popular options)
  • Use a kmer-based option (like NIKS, RUFUS, etc.).  Here, the idea is to look for differentially represented kmers and then perform de novo assembly on only the kmers that differ between samples/groups.
  • I haven't tried it, but Corset looking like an interesting option


  • I actually found CLC de novo to be the best de novo assembly tool, even though it wasn't specifically designed for RNA-Seq data.  It also automatically provides contig coverage statistics.
    • In my case, I defined the quality of the results based upon the most highly expressed contigs in various tissues (looking for genes that you could expect to be highly expressed in those different tissue types)
  • Among the open-source RNA-Seq de novo assembly options, I would recommend Oases.  In fact, you might find the merged Velvet contigs to be more useful than the transcripts (either way, you will have access to both options)
  • I would recommend against using Trinity, even though that is a popular option.  Based upon my personal experience, I would say that it often stitches together contigs from different genes, producing many very large transcripts (some fusion genes should occur, but not at the rate I saw large transcripts in Trinity)

Relevant Biostar Posts:

Wednesday, February 19, 2014

Article Review: Systematic evaluation of spliced alignment programs for RNA-seq data

I was recently asked by a colleague to provide some feedback on a Nature Methods paper by Engström et al.  I remember seeing several links to this article when it was first published, so I figure others may also be interesting in seeing my take on the paper.

  • Tested lots of programs
  • Used several benchmarks

  • No experimental validation and/or cross-platform/protocol comparison (for example, Figure 6 defines accuracy based upon overlap with known exon junctions).
    • I think qPCR validation (or microarray data, spike-ins, etc.) would be useful to compare gene expression levels - for example, see validation from Rapport et al. 2013.
  • Limited empirical test data (E-MTAB-1728 for processed values; ERR033015 / ERR033016 for raw data; total n = 2): 1 human cell line sample, 1 mouse brain sample, and simulated data.
    • In contrast, I ran differential expression benchmarks (Warden et al. 2013) comparing 2-group comparisons with much more data: patient cohort with over 100 samples (ERP001058) as well as a 2-group cell line comparison with triplicates (SRP012607).  Likewise, the cell line results also briefly compared RNA-Seq to microarray data in my paper.
  • Accordingly, there are no gene list comparisons, and I think gene expression analysis is probably the most popular type of RNA-Seq analysis
  • Used strand-specific protocol - not sure how robust findings are for other protocols.  For example, I think a lot of data currently being produced is not strand-specific.
  • Only compared at paired end alignments, but (for gene expression analysis) single-end data is probably most common and technically sufficient for gene expression analysis (I can't recall the best possible citation for this, but the Warden et al. paper shows STAR single-end and paired-end to be quite similar).  Results may differ for PE versus SE alignments.  For example, this was the case with Novoalign but not really the case for STAR; however, to be fair, this particular difference could be determined ahead of time from the Novocraft website.

In practice, I would probably choose between TopHat and STAR (two of the most popular options). I would say that this paper confirms my previous benchmarks showing that these two programs are more or less comparable with each other.  When I tested STAR, I noticed some formatting issues: for example, I think the recommended settings weren't sufficient to get it to work with cufflinks, and I think Partek had to do some re-processing to produce the stats in our paper.  I assume these problems should be fixable (and I see no technical problem with STAR), but this is why I haven't already switched to using STAR over TopHat on a regular basis.

The result I found potentially interesting is that it seems like STAR may be better than TopHat for variant calling (none of the analysis in the paper that I published can address this question).  However, I would want to see some true validation results, and I think that most users are not concerned with this (and even fewer have paired DNA-Seq and RNA-Seq data to distinguish genomic variants from RNA-editing events).

To be fair, I don't think this paper was designed to provide the type of benchmarks I was most interested in seeing.  However, I think there was still room to predict testable hypotheses and define accuracy with validation experiments.  For example, the authors could have checked how aligners affect splicing events predicted by tools like MATS, MISO, etc. (as long as they produced the samples used in the benchmarks; alternatively, it wouldn't have been too hard to produce some new data for the purpose of being able to perform validation experiments).

Plus, there was a second paper published in the same issue with a number of the same authors (Steijger et al. 2013).  So, maybe this paper isn't really meant to be read in isolation.  For example, that other paper seems to report considerable discrepancies isoform-level distributions (which matches my own experience that gene-level abundance is preferable for differential expression and splicing event predictions seem more reliable than whole transcript predictions).  In short, I would certainly recommend reading both papers - in addition to others like Rapport et al. 2013, Liu et al. 2014Seyednasrollah et al. 2013Warden et al. 2013, etc.

Tuesday, February 18, 2014

mRNA Quantification via eXpress

eXpress is a tool that allows mRNA quantification using a set of transcripts as a reference (this is opposed to popular RNA-Seq tools like TopHat, which align reads to a genome and have to model gaps caused by exon junctions).

Using transcripts rather than genomic chromosomes as a reference sequences is actually how I imagined RNA-Seq analysis would be conducted, before I learned about standard practices.  In fact, samtools provides an 'idxstats' function that can be used to calculate normalized RPKM expression values.  So, I was curious if the extra modeling done by eXpress is really any better than this simple sort of RPKM calculation: having a more complicated model can potentially improve accuracy, but more complicated models can also leave extra room for things to go wrong, can lead to over-fitting, etc.  For example, I have used eXpress on some de novo assembly data, and I actually found that normal de novo programs seemed to provide better results than those specifically designed for RNA-Seq data (however, to be clear, I think the results of this blog post emphasize that the problem was with the assembly and not the mRNA quantification, as I would have expected).

The short answer is "Yes" - I think it is better to use eXpress over idxstats for calculating RPKM/FPKM values.

To illustrate this, first take a look at the correlations between the eXpress FPKM values and the RPKM values calculated using idxstats:

The correlation isn't horrible, but you can see a non-trivial amount of genes whose expression levels have consistently lower in eXpress than idxstats.  However, this by itself doesn't really prove one options is better than the other option.  Because I feel comfortable with the gene-level mRNA quantification levels from cufflinks (and the RSEM-like algorithm implemented in Partek; for example, see Figure 5 in this paper or click here to see a direct correlation between these two results), I decided to see how the results compared when using different tools for a transcript-based reference (eXpress, idxstats) versus a genomic/chromosome-based reference (cufflinks, Partek).

Again, you see these outliers if you compare the idxstats results to cufflinks (or to Partek - click here for those results):

However, you don't see these outliers when comparing eXpress to cufflinks (or to Partek - again, click here for those results):

So, eXpress clearly provides more robust results than the simpler idxstats comparison.  You can also see this in box plot below, showing the correlation coefficients for all the mRNA quantification strategies that I tested.

Of course, systematic differences between mRNA quantification methods should (at least partially) be corrected when identifying differentially expressed genes between two groups (because the differences affect both groups).  However, there are some certain circumstances when the mRNA quantification levels may want be used in isolation, such as for ranking the most highly expressed genes in a sample (as was the case for the de novo assembly data that I worked with).  In this situations, I would definitely recommend a tool like eXpress over trying to calculate RPKM values from tools like idxstats.

FYI, here are some details on the methodology for this comparison:
  • MiSeq samples from GSE37703 were used for these comparisons.
  • Correlations were calculated using log2(FPKM/RPKM + 0.1) expression values.
  • eXpress and idxstats were run on Bowtie2 alignments of the same set of RefSeq transcripts (downloaded from the UCSC Genome Browser, with duplicated gene IDs removed).  The Partek EM algorithm used a set of RefSeq sequences used by the vendor and cufflinks used the genes.gtf file downloaded from iGenomes on the TopHat website.  Only commonly represented gene symbols were used for calculating correlations.  Only genes declared "solvable" by eXpress were considered for calculating correlations.  As an example, click here to view a venn diagram of overlapping gene symbols for SRR493372.
P.S. It looks like you may have to be signed into Google Docs to view the image previews properly.  However, you can always download the files to view them locally.
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.