Charles Warden's Science Blog: 2012

Wednesday, October 10, 2012

Summary of "Updated Phylogenetic Analysis of Polyomavirus-Host Co-Evolution"

I recently published a short article in the Journal of Bioinformatics and Research that investigated host switching in polyomaviruses (which you can also download here).

The analysis was pretty straightforward, although I thought it provided a nice example of how Mauve can help supplement traditional phylogenetic analysis. Also, it looks like most polyomavirus phylogenies in virology journals compare the divergence of individual protein sequences, but I found that analysis of the genomic sequence seems to provide more useful, consistent results.

In the interests of full disclosure, part of the reason I want to plug this paper is that I am on the editorial board for this journal. That said, I do honestly think it is a journal worth considering if you have a paper that isn't a good fit for a more established journal: it is open-access, turn-around time is quick, and it only costs $100 in processing charges for an accepted manuscript.

Monday, October 1, 2012

My DREAM Model for Predicting Breast Cancer Survival

This summer, I have worked on submitting a few models to the DREAM competition for predicting breast cancer survival.

Although I was originally planning on posting about my model after the competition was completely finished, I decided to go ahead and describe my experience because 1) my model honestly didn't radically differ from the example model and 2) I don't think I have enough time to redo the whole model building process on the new data before the 10/15 deadline.

To be clear, the performance isn't all that different for the old and new data, but there are technical details that would have to be worked out to submit the models (and I would want to take time to re-examine the best clinical variables to include in the model). For example, here are the concordance index values for my three models on the training dataset:

	Old Data	New Data
CWexprOnly	0.64	0.60
CWfullModel	0.72	NA
CWreducedModel	0.71	0.68

The old models are supposed to be converted to work on the new data. If this does happen, then I'll be able to see the performance of these models on the future datasets (additional METABRIC test dataset + new, previously unpublished dataset). That would certainly be cool, but this conversion has not yet happened.

In general, my strategy was to pick the gene expression values that correlated most strongly with survival, and I then averaged the expression of probes either positively or negatively correlated with patient survival. On top of this, I further filtered the probes to only include those that vary between high and low grade patients. My qualitative observation with working with breast cancer data has been that genes that vary with multiple clinically relevant variables seem to be more reproducible in independent cohorts. So, I thought that this might help when examining the true, new validation set. However, I gave this much smaller weight than the survival correlation (I required the probes to have a survival correlation FDR < 1e-8 and a |correlation coefficient| > 0.25, but I only required the probes to also have a differential grade FDR < 0.01).

So, these three models can be described as:

CWexprOnly: cox regression; positive and negative metagenes only

CWfullModel: cox regression; tumor size + treatment * lymph node positive + grade + Pam50Subtype + positive metagene + negative metagene

CWreducedModel: cox regression; tumor size + treatment * lymph node positive + positive metagene

The CWreducedModel was used to see how much a difference it made to only include the strongest variables (and to what extent the full model may be subject to over-fitting). The CWexprOnly model was used to see how well the gene expression could predict survival, even without the assistance of any clinical variables.

I included the treatment * lymph node positive variable because it defined a variable similar to the strongly correlated "group" variable, without making assumptions about which were the most important variables (and, as I would later learn, the "group" variable won't be provided for the new dataset).

Additionally, one observation I made prior to the model building process was how strongly the collection site correlated with survival (see below). This variable wasn't defined by the individual patient, and I assumed this should be a technical variation (or at least something that won't be useful in a truly independent validation dataset). The new data dimenishes the imact of this confounding variable, but the correlation is still there.

	Old Data	New Data
Collection Site	0.42	0.23
Group	-0.51	-0.45
Treatment	0.29	0.28
Tumor Size	-0.18	NA
Lymph Node Status	-0.24	NA

ER, PR, and HER2 status are also important variables. However, PR and HER2 status was missing in the old data, and I didn't record the original ER correlation. Therefore, they are among the variables that I don't report in the above table. Likewise, the representation of the tumor size and lymph node status variables changed between the two datasets.

This was a valuable experience to me, and I'm sure the DREAM papers that come out next year will be worth checking out. There were some details about the organization that I think can be improved (avoid changing the data throughout the competition, find a way to limit the model of models to avoid cherry picking of over-fitted, non-robust models, and providing rewards for intermediate predictions of data where the users could cheat use the publicly available test dataset). Nevertheless, I'm sure the process will be streamlined if SAGE assists with the DREAM competition next year, and I think there will be some useful observations about optimal model building from the current competition.

Monday, September 10, 2012

Using Virgin HealthMiles to Track Daily Activity

Virgin HealthMiles is a tool that allows users to keep track of their activity to earn "HealthMiles", which can be used to earn money (financed through your employer). This program caught my attention because I think the financial incentives provided a unique way to encourage people to get healthier. You can either log your activity manually or use cool gadgets to record your activity (which earn you more HealthMiles, because they are not biased).

At my company, anybody who filled out the initial heath survey earned $50. I used my $50 to buy a GoZone Pedometer, which I have been using to record my activity for approximately 2 months (see below for an example of my recorded activity).

The pedometer works well for counting steps, but not much else. For example, it is advertised that the pedometer can count "active minutes" (activity where the user takes more than 135 steps per minute), but it is really hard to earn active minutes. Fifteen minutes on the StairMaster earned me 6 active minutes. Cycling for 1 mile earned me 0 active minutes. Slaving away with strenuous yard work for 8 hours earned me 0 active minutes.

Potentially, I could be using these HealthMiles to eventually earn more expensive gadgets (which I assume are more accurate). However, my employer currently doesn't fund any of these incentives (possibly because they are trying to gauge employee interest). I think the incentives are an important part of the program, and I hope they are added in the future. If your employer does appropriately fund a Virgin HealthMiles program at your company, I think it is a useful program to help motivate employees to live an active, healthy lifestyle.

Thursday, June 14, 2012

My 23andMe Results: Getting a (Free) Second Opinion (Part II)

NOTE: Getting Advice About Genetic Testing

Since my previous post comparing my 23andMe  health report to Promethease was so popular, I thought it would be worthwhile to share what I have found from digging a little deeper into my raw 23andMe data.

This analysis required some coding on my part, but I've provided links to see a detailed description of my analysis and how to reproduce this analysis. If you don't want to try and run my scripts on your own data, you can just take a look at the high-level discussion that I have provided below.

Step #1: Annotate SNPs using SeattleSNP
Step #2: Match SNPs in GWAS Catalog
Step #3: Combine SeattleSNP and GWAS Catalog annotations. Add PAM score.
Step #4: Filter combined dataset
Step #5: Summarize features in combined dataset

Description of my 23andMe SNPs

Number of 23andMe SNPs: 950,566 (v3 array)

Unique SNPs in Combined File (SeattleSNP + GWAS Catalog): 926,754 (97.5%)

Number of SNPs with GWAS Catalog Annotations: 3,050
Number of SNPs with Disease-Associated Alleles: 1,626
     -Heterozygous Risk Allele: 990
     -Homozygous Risk Allele: 636

Number of Coding SNPs: 288,894
Number of Non-synonymous SNPs: 16,993
Number of Non-synonymous SNPS with PAM Score < 0: 915
Number of SNPs Causing Premature Stop Codons: 57

Integration of SeattleSNP and GWAS Catalog Annotations

If I filter my non-synonymous SNPs for those with odds ratios greater than 2 and a PAM score less than 0, then I can idenify a single SNP (rs1260326) with 3 entries in the GWAS catalog for associations with triglycerides (OR = 8.8, Teslovich et al. 2010), liver enzyme levels for gamma-glutamyl transferase (OR=3.2, Chambers et al. 2011), and platelet counts (OR = 2.3, Gieger et al. 2011). This allele is present in approximately 40% of the population, and it changes the coding sequence of glucokinase (hexokinase 4) regulator (GCKR). Reviewing Chambers et al. 2011 was especially interesting because GCKR was selected as one of the five genetic loci to also be tested for correlations with metabolomic data (figure 3 of that publication). In fact, GCKR seems to show the strongest correlation with increased LDL and VLDL in that figure.

NCBI Gene also indicates that GCKR is associated with diabetes, which is also described in the text for Chambers et al. 2011. Chambers et al. 2011 classify GCKR as a gene associated with inflammation, as measured by concentrations of C-reactive protien (CRP) in Elliott et al. 2009. As a general note, all of these publications require a subscription, but NCBI Gene is a good free source of information about gene functions.

The nice thing about these associations is that many of them are measured with routine blood tests. Although I have always received normal blood test results, I can easily keep an eye out for changes in the future. More specifically, Chambers et al. 2011 show an association between my GCKR SNP and gamma-glutamyl transferase levels (GGT), which is "sensitive to most kinds of liver insult, particularily alcohol" (citing Pratt et al. 2000). So, perhaps this can encourage me to continue to drink only in moderation.

Comparison with Previous Analysis

In my previous blog post, I highlighted 3 disease associations: venous thromboembolism, rheumatoid arthritis, and type I diabetes. Of course, none of these associations are identifed if I filter both by GWAS Catalog odds ratios and PAM scores, but I do find 3 SNPs associated with rheumatoid arthritis if I only filter for GWAS Catalog associations with an odds-ratio greater than 2.

The reason I didn't originally see these SNPs in my first filter is that none of them cause non-synoymous mutations. Like most of the SNPs, they were not located in coding regions. Unfortuantely, it is harder to characterize the likely function of these types of mutations, but this is certainly an exciting area of on-going research.

If I look at the GWAS catalog annotations for my SNPs, I can confirm that the GWAS catalog does contain SNPs associated with venous thromboembolism and type I diabetes (in fact, there are a lot of SNPs associated with type I diabetes), and I can confirm that I am a carrier for some of these risk alleles. However, none of these SNPs showed associations with odds ratios greater than 2.

Although I am emphasizing the overlap between different methods of analyzing my 23andMe data, I think it would be too conservative to say that only candidates that are independently identified are worth examining. For example, it is very hard to determine the best way to predict the interaction of different variants. In fact, I found it especially exciting to read about the GCKR SNP that didn't jump out at me from any of the other analysis, and I think gaining exposure to genomics research is very important benefit of having direct-to-consumer genetic testing.

Summarize 23andMe SNP Categories

The primary goal of this script is to provide statistics about your 23andMe SNPs (number of annotated SNPs, number of homozygous / heterozygous disease assocations, number of coding SNPs, etc.)

Step #1:Create a

Prepare combined SNP file (click here for details)
This will also work for filtered files (check here for details)

Step #2: Produce Summary Statistics

Download the perl script 23andMe_stats.pl
There is one parameter that you need to enter:

inputfile = file containing 23andMe SNPs with both SeattleSNP and GWAS Catalog annotations (click here for details)

PC Users

Open a terminal window (type "cmd" in Run, for example)
Move to the folder where your 23andMe data is saved.

Basic commands:

cd = change folder

If the data is not in your C:\ drive, you can type "cd \d D:"

.. = move up one folder

Type in "perl 23andMe_GWAS_stats.pl" and enter the required genome parameter. See example below (click to enlarge) .

Mac Users

Open Terminal (in Applications/Utilities, for example)
Basic commands:

cd = change folder
.. = move up one folder

Type in "perl 23andMe_GWAS_ stats .pl" and enter the required genome parameter. See example below (click to enlarge) .

I have tested my perl scripts on a PC and Mac, but I cannot guarentee that they will work on every possible platform. Also, these scripts may need modifications as file formats change, but I have currently confirmed that my scripts work with v2 and v3 arrays using genomes from Genomes Unzipped. If you have any questions or comments, please post them below and I will do my best to help troubleshoot.

Filter Combined Annotations for 23andMe SNPs

Step #1: Prepare Inputfile

List of 23andMe SNPs with both SeattleSNP and GWAS Catalog annotations (click here for details)

Step #2: Filter List of SNPs

Download the perl script 23andMe_filter.pl
There is one parameter that you need to enter:

input = file containing 23andMe SNP file with SeattleSNP and GWAS Catalog SNPs (see here for more details)

There is 5 optional parameters that you can enter:

output = output file containing filtered SNP lists. By default, _filter.txt is appended to the end of the input file
OR = odds ratio cutoff (filter for scores greater than cutoff) [default = 2]
PAM = PAM score cutoff (filter for scores less than cutoff) [default = 0]
risk_status = status for GWAS Catalog risk allele, Either "Homozygous", "Heterozygous" (which actually filters for both homozygous and heterozygous risk alleles), or "none" [default = "Heterozygous]
allele_freq = set of parameters to describe allele frequency cutoff. If provided, parameter must be the following format [genetic background]_[comparison type]_[threshold] For example, European_gt_0.25. [default = "none?]

Genetic background can be "European", "African", and "Asian"
Comparison type can be "gt" for greater than or "lt" for less than
Threshold corresponds to the population frequency. Must be between 0 and 1.

PC Users

Open a terminal window (type "cmd" in Run, for example)
Move to the folder where your 23andMe data is saved.

Basic commands:

cd = change folder

If the data is not in your C:\ drive, you can type "cd \d D:"

.. = move up one folder

Type in "perl 23andMe_filter.pl" and enter the required input parameter. See example below (click to enlarge) .

You can also enter in optional parameters (OR, PAM, risk_status , and/or allele_freq ). See example below (click to enlarge) .

Mac Users

Open Terminal (in Applications/Utilities, for example)
Basic commands:

cd = change folder
.. = move up one folder

Type in "perl 23andMe_ filter.pl" and enter the required input parameter. See example below (click to enlarge).

You can also enter in optional parameters (OR, PAM, risk_status , and/or allele_freq ). See example below (click to enlarge) .

Combine SeattleSNP and GWAS Catalog Annotations for 23andMe SNPs

There are two main functions for this script:

1) Combine the results from 23andMe_to_SeattleSNP.pl and 23andMe_GWAS_catalog.pl

2) Add a score to predict the severity of non-synonymous SNPs. In this case, I am adding a PAM score (created from this matrix). These scores are correlated with the frequency of various amino acids substitutions over time. In fact, there are different PAM matrics that can be used. There are some slightly more rigorous tools to accomplish this (such as PolyPhen or SIFT), and SeattleSNP can provide PolyPhen predictions for certain SNPs. However, I wanted to use the PAM score as something that can be quickly added to all the non-synonymous mutations.

Step #1: Prepare Inputfiles

SeattleSNP annotations (click here for details)
GWAS Catalog annotations (click here for details)
My PAM matrix can be downloaded here.

Step #2: Combine Files

Download the perl script 23andMe_combine.pl
There are three parameters that you need to enter:

seattleSNP =23andMe SNPs with SeattleSNP annotations (click here for details)
GWAS =23andMe SNPs with GWAS Catalog annotations. Please note that this is not the original GWAS annotation file but the file that was created at this step. (click here for details)
PAM = substitution matrix indicating the severity of the non-synonymous mutation (such as the file provided here)

The outputfile will have _combined.txt appended to the end of the seattleSNP file name.
PC Users

Open a terminal window (type "cmd" in Run, for example)
Move to the folder where your 23andMe data is saved.

Basic commands:

cd = change folder

If the data is not in your C:\ drive, you can type "cd \d D:"

.. = move up one folder

Type in "perl 23andMe_GWAS_catalog.pl" and enter the required SeattleSNP, GWAS, and PAM parameters. See example below (click to enlarge) .

Mac Users

Open Terminal (in Applications/Utilities, for example)
Basic commands:

cd = change folder
.. = move up one folder

Type in "perl 23andMe_GWAS_catalog.pl" and enter the required SeattleSNP, GWAS, and PAM parameters . See example below (click to enlarge) .

Find 23andMe SNPs with GWAS Catalog Annotations

Although there are other tools to help sort through the annotations in the GWAS Catalog, I've found that none of them to completely satsify my needs. More importantly, SeattleSNP clinical associations don't directly provide the name of the disease they are associated with and are not identical to the annotations in the GWAS Catalog. So, this information is meant to complement the report that can be obtained from SeattleSNP.

Step #1: Download GWAS Catalog Data

There should be a link on the main GWAS catalog website to download the full catalog. As of today, you can click this link to view / download the annotations.

For most internet browsers, you can download the data as a tab-delimited file by right-clicking on the link and then left-clicking "save target as...".
Please no not copy and paste the table from your browser. This may not preserve the proper formatting

Please save the GWAS annotations in the same folder as your 23andMe data

The file is currently saved as gwascatalog.txt. If the name of this file changes in the future, please rename the file gwascatalog.txt

Step #2: Find Overlapping SNPs

Download the perl script 23andME_GWAS_catalog.pl
There is one parameter that you need to enter:

genome = raw data file from 23andMe

The resulting output file with have _GWAS.txt appended to the name of the genome file
PC Users

Open a terminal window (type "cmd" in Run, for example)
Move to the folder where your 23andMe data is saved.

Basic commands:

cd = change folder

If the data is not in your C:\ drive, you can type "cd \d D:"

.. = move up one folder

Type in "perl 23andMe_GWAS_catalog.pl" and enter the required genome parameter. See example below (click to enlarge) .

Mac Users

Open Terminal (in Applications/Utilities, for example)
Basic commands:

cd = change folder
.. = move up one folder

Type in "perl 23andMe_GWAS_catalog.pl" and enter the required genome parameter. See example below (click to enlarge) .

You can open and manipulate the resulting file in Excel (or OpenOffice Calc)

Reformat 23andMe Data for SeattleSNP

Step #1: Download Raw Data from 23andMe

After signing into 23andMe, first go to "Account" (in the top right hand corner of the screen) and then "Browse Raw Data"
Click the link near the top of the page to "download raw data"
Choose "All DNA" for your data set, and then click "Download Data"

Step #2: Reformat Raw Data

Download the perl script 23andMe_to_SeattleSNP.pl
There is one parameter that you need to enter:

genome = raw data file from 23andMe

PC Users

Open a terminal window (type "cmd" in Run, for example)
Move to the folder where your 23andMe data is saved.

Basic commands:

cd = change folder

For example, If the data is in your D:\ drive, you can type "cd \d D:"

.. = move up one folder

Type in "perl 23andMe_to_SeattleSNP.pl" and enter the required genome parameter. See example below (click to enlarge) .

Mac Users

Open Terminal (in Applications/Utilities, for example)
Basic commands:

cd = change folder
.. = move up one folder

Type in "perl 23andMe_to_SeattleSNP.pl" and enter the required genome parameter. See example below (click to enlarge).

Step #3: Upload Data to SeattleSNP

The 23andMe SNP data currently uses NCBI 36 / hg18. You can confirm if this is still the case by using a text editor like Notepad++ to view the raw data.

There are a few different portals to access SeattleSNP annotations, but you will need to use this link if the 23andMe data is currently using NCBI 36 (as of today, NCBI 37 / hg19 is the latest genome build): http://snp.gs.washington.edu/SeattleSeqAnnotation/

Enter your e-mail address
Select the file created by the perl script. It should be almost identical to the genome file, but it will say _SeattleSNP.txt at the end of the file
This file conforms to the "custom" format, so please select "custom" under "input file format" and enter the following information

Chromosome: 2
Location: 3
Reference Allele: 0
First Allele: 4
Second Allele: 5

Click the green submit button
It may take several hours to annotate your 23andME SNPs. You will recieve an e-mail message when the annoted file is ready to download.

Friday, May 25, 2012

Shared Scripts for Genomic Analysis

I've recently added a section to my personal website that contains some scripts that I have used for bioinformatics analysis:

https://sites.google.com/site/cwarden45/scripts

As of right now, the page contains a handful of scripts for microarray, next-generation sequencing, and qPCR analysis. I plan on updating this page periodically.

Generally speaking, the scripts aren't organized in a carefully documented package (like what you can find from Bioconductor, etc.). However, I have found them to be very useful templates for routine analysis, so I thought it might be useful to share them with others.

Monday, January 30, 2012

Article Review: Accurate identification of A-to-I RNA editing in human by transcriptome sequencing

In this article, Bahn et al. develop a novel method to identify A-to-I RNA editing sites in next-generation sequencing data.

My favorite aspect of this paper was how the authors empirically estimated the false discovery rate of their algorithm using an ADAR siRNA knock-down in a cancer cell line that only showed normal expression levels for one member of the ADAR family (shown in Figure 2 of the paper). Experimental validation with Sanger sequencing also shows a low false positive rate for the A-to-G events (although not necessarily for non-A-to-G events).

Supplemental Table 3 is also worth checking out: it provides a good review of genome-wide RNA editing studies, including the contentious study in Science by Li et al. For example, only 34% of the RNA editing sites shared by Li et al. and this paper were A-to-G events, whereas 86-100% of the overlapping sites for all of the other studies were A-to-G events. Likewise, the differences in the histograms for RNA editing sites (Figure 2A in this paper, and Figure 1A in Li et al.) emphasize how different the analysis in Li et al. is from other similar studies in the literature.

The supplemental table also shows how few RNA editing sites overlap between studies. For example, the authors emphasize how their study recovers 854 A-to-G differences in the DARNED database, but I think it is worth keeping in mind that there were 42,045 sites in the DARNED database and 9636 predicted RNA editing sites (using the threshold for comparison with other studies). This seems to be a common problem that isn't unique to this study (and the authors emphasize that the overlap between genes with RNA editing sites is greater than the overlap of individual RNA editing sties), but I think it is still an interesting observation that is worth keeping in mind for future analysis (which will hopefully have larger samples of paired DNA-Seq and RNA-Seq samples).

In general, I think this method does a good job of identifying and filtering likely causes of spurious RNA editing events (like those mentioned in Schrider et al. 2011). For example, the authors use a "double-filtering" strategy to focus on reads with unique alignments (where a conservative threshold is used to define alignments to potential RNA editing sites but a more liberal criteria is used to search for homologous regions that could be causing inaccurate alignments). I also liked that most of the in-depth analysis focused on sites with an editing ratio greater than 0.2.

This study focused on analysis of the grade IV glioma cell line U87MG (RNA-Seq: GSE28040, DNA-Seq: GSE19986) and a primary breast cancer sample (EGAS00000000054). Although it probably allowed for more cost-effective analysis, I wonder if the results would have been even cleaner if the RNA-Seq and DNA-Seq data were both newly created for this study using similar technologies (for example, the RNA-Seq data is paired-end Illumina reads whereas the DNA-Seq data was from another study using SOLiD reads). However, I think the results were clean enough that this probably didn't matter too much (based upon the ADAR knock-down data).

The novel motif discovery (Figure 5) was interesting, but I had a hard time imagining the relevance of this motif that isn't found at a consistent distance from the A-to-I site (like those shown in Figure 4). That said, I would be interested in see any follow-up analysis that characterizes the mechanism by which this motif is involved with A-to-I editing.

I think this study only provides very limited analysis on A-to-I editing in cancer. To be fair, the sample size (one sample at a time) is probably not sufficient to make many general claims about A-to-I editing in cancer. However, I still think this aspect of the study was over-emphasized. For example, Supplemental Table 13 shows how sensitive the hypergeometric test (comparing RNA editing sites in the two samples) will be when dealing with such a large background set; all of the RNA editing events except G-to-C were statistically significant with a p-value < 0.05, even though the A-to-G overlap was the only category with more than 5 overlapping sites. In other words, I don't think statistical significance was a strong indicator of biological importance for this analysis. Likewise, it was nice that the enrichment analysis of the NCI Cancer Gene Index genes provided some candidate genes, but I don't think this study is useful in identifying a gene where A-to-I editing is highly likely to play an important role in oncogenesis.

Overall, I would recommend this article to anyone interested in RNA editing and next-generation sequencing analysis.

Charles Warden's Science Blog

Wednesday, October 10, 2012

Summary of "Updated Phylogenetic Analysis of Polyomavirus-Host Co-Evolution"

Monday, October 1, 2012

My DREAM Model for Predicting Breast Cancer Survival

Monday, September 10, 2012

Using Virgin HealthMiles to Track Daily Activity

Thursday, June 14, 2012

My 23andMe Results: Getting a (Free) Second Opinion (Part II)

Summarize 23andMe SNP Categories

Filter Combined Annotations for 23andMe SNPs

Combine SeattleSNP and GWAS Catalog Annotations for 23andMe SNPs

Find 23andMe SNPs with GWAS Catalog Annotations

Reformat 23andMe Data for SeattleSNP

Friday, May 25, 2012

Shared Scripts for Genomic Analysis

Monday, January 30, 2012

Article Review: Accurate identification of A-to-I RNA editing in human by transcriptome sequencing

About Me

My Websites

Blog Archive

Labels

Wednesday, October 10, 2012

Monday, October 1, 2012

Monday, September 10, 2012

Thursday, June 14, 2012

Friday, May 25, 2012

Monday, January 30, 2012

About Me

My Websites

Blog Archive

Labels

Follow Me!