We’ve been recently working with a next-gen sequencing data set.

I want to point out a feature of the data that we didn’t expect to see, namely that genome coverage has very large inter-region variance.  Check out this z-score normalized distribution of read coverage per 1Mbase genomic region:

Not beautiful.  There is some some ~600sd outlier causing trouble.  This is interesting simply on merit of being surprising.

It’s also intersting for practical considerations — some assumptions built into our software design aren’t optimized for this kind of very irregular, very high coverage.  The good news is that we can clean this up without too much effort.  The distribution of reads is presumed to be Poisson, for closer to normal at higher levels of coverage.  Let’s see what happens if we trim off the top decile of data:

Better.  What’s up with this blip at -10z?  Oh, and yes, it does start to look normal if we bandpass and chop off the bottom decile as well:

There’s some R code below for how I went about this. I’m also attaching a data file of the overall z-score coverage for all ~3K 1Mbase genomic regions.

The format of the file is like:

chr1-20-143 -0.410984633519499
chr13-20-100 -0.271009467956187
chr14-20-69 -0.190734553729517

Column 2 is the z-score. Column 1 is the genome region. . Bits is used to compute the region size. Here we use bits=20, so 2**20=1MBase regions.

I might take a closer look what portions of the genome are falling into this blip at -10z. It’s at half coverage of the main data mode… maybe a copy number variant?

bin.size = 2**20;
read.length = 90;
bc = read.table('~/Desktop/bin_counts.dat',header=F);
bc.hipass = bc[bc[,1]>quantile(bc[,1],0.1),];
bc.lopass = bc[bc[,1]<quantile(bc[,1],0.9),];
bc.bandpass = bc[bc[,1]<quantile(bc[,1],0.9)&bc[,1]>quantile(bc[,1],0.1),];
coverage.mean = mean(read.length*bc.bandpass[,1]/bin.size);
coverage.sd = sd(read.length*bc.bandpass[,1]/bin.size);
z.bandpass = ((read.length*bc.bandpass[,1]/bin.size) - coverage.mean) / coverage.sd;
z.hipass = ((read.length*bc.hipass[,1]/bin.size) - coverage.mean) / coverage.sd;
z.lopass = ((read.length*bc.lopass[,1]/bin.size) - coverage.mean) / coverage.sd;
z = ((read.length*bc[,1]/bin.size) - coverage.mean) / coverage.sd;
z.label = as.matrix(z);
rownames(z.label) = bc[,2];