ANNOVAR Performance Optimizations

Ion Flux contributed major performance improvements to the latest version (2011Sep11) of ANNOVAR.

From the ANNOVAR website:

2011Sep11: New Version of ANNOVAR is released with significant speedup of filter operation for certain databases (dbSNP, SIFT, PolyPhen, 1000G, etc). In previous version of ANNOVAR, filter-based annotation for ex1.human (12 variants) requires ~10 minutes for snp132, sift or polyphen. In the new version, it takes 1 second only! […]

We like to see open-source software projects thrive, and we’re happy to contribute where we’re able to do so. Thanks to ANNOVAR’s author, Kai Wang, for providing such a useful piece of software, and to Ion Flux engineer Marine Huang for optimizing the implementation.

Elastic Analysis Pipelines at Amazon AWS

Ion Flux founder and CEO Allen Day will be presenting the topic “Elastic Analysis Pipelines” at Amazon’s AWS Genomics Event on September 22.

UPDATE: Slides from Allen’s presentation are now available here

Consequence Analysis of Ion Torrent’s Gordon Moore data

Now that Gordon Moore‘s genome variants are available as part of Ion Torrent’s publication in Nature this week, we’re announcing some work we’ve done analyzing an early, low-coverage version of these data.

Ion Flux has been building highly-scalable systems for analyzing personal genome data. One of these systems produces a gene panel consequence report. The report’s purpose is to indicate where variants have been detected in a group of genes and what, if any, are the consequences of the detected variants. We specifically chose to analyze a small panel of well-known cancer genes, for many of which treatment options are available or are in clinical trials.

You can have a look at some demo output of our gene panel consequence report. We emphasize that the information content of this report is based on pre-publication, low-coverage Ion Torrent data and is insufficient to make consequence claims with any acceptable level of confidence.

We encourage you to also have a look at the SNPedia-based Promethease report on Moore’s variants that were published.

Cloud-based Human Genome Alignment

Motivation

The purpose of generating the data set described here is two-fold:

  1. We want to better understand the sensitivity and specificity of particular sequences/regions in the genome, ultimately leading to faster, higher quality genome alignments.
  2. We want to establish time and cost benchmarks for processing large volumes of Human genome data.

Methods

We processed a 50bp step1 sliding-window data set of Human Genome build 19 using Ion Torrent’s TMAP algorithm (from author Dr. Nils Homer). The specificity/sensitivity analysis is underway right now, and will be published here in a followup blog post.

Results

Let’s look at the gross statistics of the operation:

How were the data generated?

Input was a 1bp step, 50mer tiling data set generated from HG19. We’re not hosting these data, you can generate them yourself with this make_reads.pl script.

Output was a set of SAM files from TMAP, which were then post-processed to a more compact form. See the Data section, below.

How much data?

3 gigabases * 50 bases/read * 2 strands * 50 samples/basepair = 1.5 terabases of sequence input for a 50x uniform-coverage Human genome.

How did you do it?

We’re using Amazon’s cloud. Check out Amazon’s case study of Ion Flux.

How long did it take? How much did it cost?

It took less than 1 business day, and we can easily scale down much further. Contact us [media] [business] if you want to know more about our pricing and methodology.

Data

Raw outputs are in the S3 bucket here: http://tmap-mapability.s3.amazonaws.com/. Here’s a snippet:

12:+111149052	12:+111149052	3	100
12:-111149052	12:-111149052	1	100
12:+111149053	12:+111149053	3	100
12:-111149053	12:-111149053	3	100
12:-111149054	12:-111149054	1	100
12:+111149054	12:+111149054	3	100

Format is like this:

[readChromosomeName]:[readStrand][readPosition] [targetChromosomeName]:[targetStrand][targetPosition] [mappingAlgorithm][score]

*Strand fields take the value + and -. Read lengths are always 50bp, so you can figure out the actual sequence aligned to the target. The mappingAlgorithm can be one of {1, 2, 3}

Where is the analysis?

We’ll be publishing data and some pretty pictures of the sensitivity/specificity analysis soon, stay tuned!

Human Genomics Crash Course 2

We’re posting a series of blog posts explaining genetics for the layperson in the context of their health.

This is part two of the series, and describes how de novo genome sequencing is done. Other posts in this series:

  1. Part 1, genotypes, phenotypes, and polymorphism

This video describes how DNA is extracted from a biological tissue sample and prepared to form a “library” of bacterial clones that contain small pieces of the sample’s DNA. These bacteria are then grown in culture and sequenced using the Sanger method:

After the DNA is read out, it needs to be assembled. This traditional method of sequencing and assembly focuses on reading out longer sequences, which can be assembled with fewer readouts of each position:

Contrast traditional sequencing with shotgun sequencing and assembly, shown here:

The main consideration when choosing between traditional and shotgun methods of sequencing are, roughly:

  • Price per base sequenced (physical materials)
  • Price per base assembled (computing costs)
  • Shotgun method produces sequence at a higher rate, but with more errors and has more assembly problems

Lower costs for sequencing each base favor the shotgun method over the traditional method. Shotgun sequencing is the method of choice today because of decreases in price-per-base sequenced, to be discussed in a later article.

Human Genomics Crash Course 1

We’re posting a series of blog posts explaining genetics for the layperson in the context of their health.

This is part one of the series, and introduces the fundamental genetics concepts of genotype, phenotype, polymorphism and a vignette of how genomic data can be applied in a clinical setting. We’ll use some of the excellent marketing material from the personal genotyping company 23andMe.

Genes: background.

SNPs: some of what we find through our data analysis are SNPs. A SNP is a specific type of polymorphism, or sequence variant. We’re looking for SNPs with some effect (a functional consequence).

Heredity: background.

Genotypes, phenotypes and medicine.


AWS Ion Flux Case Study

Our case-study with Amazon AWS was published today: http://aws.amazon.com/solutions/case-studies/ion-flux/

Genome Coverage Irregularities

We’ve been recently working with a next-gen sequencing data set.

I want to point out a feature of the data that we didn’t expect to see, namely that genome coverage has very large inter-region variance.  Check out this z-score normalized distribution of read coverage per 1Mbase genomic region:

Not beautiful.  There is some some ~600sd outlier causing trouble.  This is interesting simply on merit of being surprising.

It’s also intersting for practical considerations — some assumptions built into our software design aren’t optimized for this kind of very irregular, very high coverage.  The good news is that we can clean this up without too much effort.  The distribution of reads is presumed to be Poisson, for closer to normal at higher levels of coverage.  Let’s see what happens if we trim off the top decile of data:

Better.  What’s up with this blip at -10z?  Oh, and yes, it does start to look normal if we bandpass and chop off the bottom decile as well:

There’s some R code below for how I went about this. I’m also attaching a data file of the overall z-score coverage for all ~3K 1Mbase genomic regions.

The format of the file is like:

chr1-20-143 -0.410984633519499
chr13-20-100 -0.271009467956187
chr14-20-69 -0.190734553729517
[...]

Column 2 is the z-score. Column 1 is the genome region. . Bits is used to compute the region size. Here we use bits=20, so 2**20=1MBase regions.

I might take a closer look what portions of the genome are falling into this blip at -10z. It’s at half coverage of the main data mode… maybe a copy number variant?


bin.size = 2**20;
read.length = 90;
bc = read.table('~/Desktop/bin_counts.dat',header=F);
bc.hipass = bc[bc[,1]>quantile(bc[,1],0.1),];
bc.lopass = bc[bc[,1]<quantile(bc[,1],0.9),];
bc.bandpass = bc[bc[,1]<quantile(bc[,1],0.9)&bc[,1]>quantile(bc[,1],0.1),];
coverage.mean = mean(read.length*bc.bandpass[,1]/bin.size);
coverage.sd = sd(read.length*bc.bandpass[,1]/bin.size);
z.bandpass = ((read.length*bc.bandpass[,1]/bin.size) - coverage.mean) / coverage.sd;
z.hipass = ((read.length*bc.hipass[,1]/bin.size) - coverage.mean) / coverage.sd;
z.lopass = ((read.length*bc.lopass[,1]/bin.size) - coverage.mean) / coverage.sd;
z = ((read.length*bc[,1]/bin.size) - coverage.mean) / coverage.sd;
z.label = as.matrix(z);
rownames(z.label) = bc[,2];
write(names(z.label[z.label>30|z.label<(-10),]),"/Users/allenday/Desktop/genome_coverage_z10.txt");
write.table(cbind(as.character(bc[,2]),z),"/Users/allenday/Desktop/genome_coverage_zscores.txt",quote=F,row.names=F,col.names=F,sep="\t");
hist(z.lopass);
hist(z.bandpass);
hist(z.hipass);