Monthly Archives: September 2012

Sequestering the hand the feeds

ASBMB policy blog just posted a concise summary of what impact sequestration could have on NIH and NSF funded research.

Today, the administration’s Office of Budget Management (OMB) released its plan, and it paints a pretty grim picture for biomedical research. The NIH and NSF would each sustain an 8.2% cut to their budgets which would amount to a reduction of roughly $2.5 billion and $460 million, respectively.

An 8.2% budget cut doesn’t mean that we are “trimming the fat”, or keeping expenses under control. I’m also not sure if it means we’ll be losing many jobs either. The scientific community is closely knit, and funding has been stagnant for a few years. We’ll keep doing some research, but not as much, and the size of incoming graduate student cohorts will continue to dwindle as training grants become smaller. I’m not sure whether this last part is overwhelmingly good or bad. Continue reading

Spreadsheets for Science?

What place do spreadsheets have in biological sciences? Are they a useful format for data sharing with colleagues, or a useless blight on information superhighway? Your opinion of spreadsheets might directly correlate with your technical knowledge. For the scientist who understands that “everything is a file” and expects to read and write their data to straightforward and open formats, spreadsheets might be something of a nuisance. For the scientist whose first and last word in statistics is Microsoft Excel, spreadsheets are an essential and powerful window into a set of data. What if you more are interested in motif detection and DNA transcription than column sums and vlookups?  It would be convenient for most biologists if a spreadsheet could handle this type of task. Pierre Lindenbaum has written an excellent example of such a utility on his blog. Implemented in javascript, and utilizing Google Apps Script in Google Drive (Docs), his utility will translate a DNA sequence into a protein sequence. Pretty neat.

ENCODE embroglio

The most recent issue of Nature was all about ENCODE. Most of the ENCODE findings are not exactly revolutionary to anyone in life sciences, but the relatively massive amount of press attention has created some surprising misinterpretations of “junk” DNA and what, exactly, ENCODE has published. John Timmer at Ars Technica writes a response to the claims of “junk DNA is solved!”, and does a pretty good job tempering the enthusiasm of CNN. My favorite quotes from the CNN article:

When the Human Genome Project sequenced the human genome in 2003, it established the order of the 3 billion letters in the genome, which can be thought of as “the book of life.”

We are still establishing the order of the “letters in the genome”, so – strike one against Liz Landau. So Liz, what exactly did  ENCODE determine?

Continue reading

De-multiplexing paired-end sequencing data

After surveying the existing tools for de-multiplexing barcoded paired-end sequencing reads, I grew frustrated and rolled my own solution. The issue with most of the tools for de-multiplexing reads from massively parallel sequencing is that they operate on fastq files. Since paired-end sequencing typically generates two fastq files (one for forward reads and one for reverse reads), it becomes more difficult to apply existing single-end read tools without doing some extra matching and filtering. Most people seem to either:

  1. Merge the forward and reverse reads into one “megaread”, demultiplex these based on barcode sequence, and then split the resulting reads back into forward and reverse before mapping.
  2. Filter the forward reads based on barcodes, and take advantage of same sorting order of forward and reverse reads to match pairs from the reverse reads.
Both of these routes involve creating many individual fastq files that will be individually mapped. Depending on the aligner and amount of sequence, mapping something like 75 de-multiplexed sets of reads could be inefficient since you would be initializing the aligner and indexed reference genome 75 times. This does not seem like an optimal solution.
Since my immediate purpose is amplicon resequencing on a MiSeq, the number of reads I will be dealing with is fairly low, so I think I can design a more logical workflow. Ideally, I would like to map all the reads at once, moving the barcode from each read into the SAM BC tag. Then I can split the resulting mapped SAM file into de-multiplexed files for analysis.
First off, mapping my reads with BWA. I’m using Bpipe for pipeline management.
@Transform("sam")
bwa_aln_bc = {
    exec "bwa aln -t $threads -B $length $bwa_reference $input1 > ${input1}.sai"
    exec "bwa aln -t $threads  $bwa_reference $input2 > ${input2}.sai"
    exec "bwa sampe $bwa_reference ${input1}.sai ${input2}.sai $input1 $input2 > $output"
}

 The above function truncates $length number of bases from each forward read, and assigns that as a BC tag in the resulting SAM alignment e.g.: “BC:Z:TTAATGC”.

Next, we will split the barcoded SAM file into multiple files, based on the barcodes, while preserving the SAM header. All reads that do not match a barcode in the supplied table will be written to a separate file.

#!/bin/sh                                                                               
if [ $# -eq 0 ] ; then
    echo 'Usage: splitSam.sh input.sam barcodes.txt'
    echo ''
    echo 'barcodes.txt must be two columns with tab delimeter'
    echo 'column1 = barcode name, column2 = barcode sequence'
fi
SAM=$1
BC_FILE=$2

#Capture the SAM header                                                                 
SAM_H=`samtools view -SH $SAM`

#Read barcode table into array line by line                                             
#then grep barcoded SAM reads to file                                                   
while IFS=$'\t' read -r -a array
do
    sampleName=${array[0]} #barcode name                                                
    BC=${array[1]} #barcode sequence                                                    
    BC=${BC%"${BC##*[![:space:]]}"} #remove trailing whitespace                         
    if grep -q -m 1 -e "BC:Z:$BC" $SAM; then
        printf "$SAM_H\n" > ${sampleName}.sam #write SAM header to file                 
        grep -e "BC:Z:$BC" $SAM >> ${sampleName}.sam #write barcoded reads              
    fi
    printf "BC:Z:$BC\n" >> /tmp/bc #patterns for unmatched reads                        
done < $BC_FILE

#Write unmatched reads to file                                                          
grep -v -f /tmp/bc $SAM > unmatched.sam #write unmatched reads                          
rm /tmp/bc

This approach, while efficient for a small number of reads, may be inappropriate for larger projects.