The purpose of generating the data set described here is two-fold:

  1. We want to better understand the sensitivity and specificity of particular sequences/regions in the genome, ultimately leading to faster, higher quality genome alignments.
  2. We want to establish time and cost benchmarks for processing large volumes of Human genome data.


We processed a 50bp step1 sliding-window data set of Human Genome build 19 using Ion Torrent’s TMAP algorithm (from author Dr. Nils Homer). The specificity/sensitivity analysis is underway right now, and will be published here in a followup blog post.


Let’s look at the gross statistics of the operation:

How were the data generated?

Input was a 1bp step, 50mer tiling data set generated from HG19. We’re not hosting these data, you can generate them yourself with this script.

Output was a set of SAM files from TMAP, which were then post-processed to a more compact form. See the Data section, below.

How much data?

3 gigabases * 50 bases/read * 2 strands * 50 samples/basepair = 1.5 terabases of sequence input for a 50x uniform-coverage Human genome.

How did you do it?

We’re using Amazon’s cloud. Check out Amazon’s case study of Ion Flux.

How long did it take? How much did it cost?

It took less than 1 business day, and we can easily scale down much further. Contact us [media] [business] if you want to know more about our pricing and methodology.


Raw outputs are in the S3 bucket here: Here’s a snippet:

12:+111149052	12:+111149052	3	100
12:-111149052	12:-111149052	1	100
12:+111149053	12:+111149053	3	100
12:-111149053	12:-111149053	3	100
12:-111149054	12:-111149054	1	100
12:+111149054	12:+111149054	3	100

Format is like this:

[readChromosomeName]:[readStrand][readPosition] [targetChromosomeName]:[targetStrand][targetPosition] [mappingAlgorithm][score]

*Strand fields take the value + and -. Read lengths are always 50bp, so you can figure out the actual sequence aligned to the target. The mappingAlgorithm can be one of {1, 2, 3}

Where is the analysis?

We’ll be publishing data and some pretty pictures of the sensitivity/specificity analysis soon, stay tuned!