Category Archives: linux

Automating Apple time machine backups to Amazon Glacier using python, tar, and zfs

In a previous post I played around with Amazon Glacier, using a tool called glacierFreezer. Since then, I’ve wanted to automate backups of my Time Machine archives, as well as my photos and home directories. Looking around for more current Glacier interfaces I noticed a project called glacier-cmd which looks promising. The core utilities are written in Python, and provide means to upload, download, and query Glacier vaults.

Continue reading

Adding Airplay to an external usb audio interface

The Raspberry Pi is a fairly powerful $25 single-board computer targeted toward the educational market, though just because it’s for kids doesn’t mean it’s not fun for adults. I’ve been wanting to buy a Pi for a while now, but couldn’t justify purchasing hardware I have no use for – that is, until I saw this blog post detailing how to use the Pi as an Apple Airplay receiver. This is perfect. Let’s get started turning our little computer into a single-purpose appliance.

Here it is, after a few weeks of waiting!

Here it is, after a few weeks of waiting!

Continue reading

Using Galaxy in the cloud

Recently I gave a presentation about Galaxy, and as part of the presentation I walked about 30 people through setting up a Galaxy cluster through Amazon Web Services (AWS). The AWS setup took most of an hour, and moving 30 people through each step was painful. From pain comes prosperity (apparently), because today I stumbled on a link from the main Galaxy public server that allows a user to automatically initialize a Galaxy cluster through AWS! Where were you last month? Anyway, I’ve updated the presentation with a link to the site. I’ve not tested this method of Cloudman Galaxy initialization, but I’m assuming it should work really well.

De-multiplexing paired-end sequencing data

After surveying the existing tools for de-multiplexing barcoded paired-end sequencing reads, I grew frustrated and rolled my own solution. The issue with most of the tools for de-multiplexing reads from massively parallel sequencing is that they operate on fastq files. Since paired-end sequencing typically generates two fastq files (one for forward reads and one for reverse reads), it becomes more difficult to apply existing single-end read tools without doing some extra matching and filtering. Most people seem to either:

  1. Merge the forward and reverse reads into one “megaread”, demultiplex these based on barcode sequence, and then split the resulting reads back into forward and reverse before mapping.
  2. Filter the forward reads based on barcodes, and take advantage of same sorting order of forward and reverse reads to match pairs from the reverse reads.
Both of these routes involve creating many individual fastq files that will be individually mapped. Depending on the aligner and amount of sequence, mapping something like 75 de-multiplexed sets of reads could be inefficient since you would be initializing the aligner and indexed reference genome 75 times. This does not seem like an optimal solution.
Since my immediate purpose is amplicon resequencing on a MiSeq, the number of reads I will be dealing with is fairly low, so I think I can design a more logical workflow. Ideally, I would like to map all the reads at once, moving the barcode from each read into the SAM BC tag. Then I can split the resulting mapped SAM file into de-multiplexed files for analysis.
First off, mapping my reads with BWA. I’m using Bpipe for pipeline management.
@Transform("sam")
bwa_aln_bc = {
    exec "bwa aln -t $threads -B $length $bwa_reference $input1 > ${input1}.sai"
    exec "bwa aln -t $threads  $bwa_reference $input2 > ${input2}.sai"
    exec "bwa sampe $bwa_reference ${input1}.sai ${input2}.sai $input1 $input2 > $output"
}

 The above function truncates $length number of bases from each forward read, and assigns that as a BC tag in the resulting SAM alignment e.g.: “BC:Z:TTAATGC”.

Next, we will split the barcoded SAM file into multiple files, based on the barcodes, while preserving the SAM header. All reads that do not match a barcode in the supplied table will be written to a separate file.

#!/bin/sh                                                                               
if [ $# -eq 0 ] ; then
    echo 'Usage: splitSam.sh input.sam barcodes.txt'
    echo ''
    echo 'barcodes.txt must be two columns with tab delimeter'
    echo 'column1 = barcode name, column2 = barcode sequence'
fi
SAM=$1
BC_FILE=$2

#Capture the SAM header                                                                 
SAM_H=`samtools view -SH $SAM`

#Read barcode table into array line by line                                             
#then grep barcoded SAM reads to file                                                   
while IFS=$'\t' read -r -a array
do
    sampleName=${array[0]} #barcode name                                                
    BC=${array[1]} #barcode sequence                                                    
    BC=${BC%"${BC##*[![:space:]]}"} #remove trailing whitespace                         
    if grep -q -m 1 -e "BC:Z:$BC" $SAM; then
        printf "$SAM_H\n" > ${sampleName}.sam #write SAM header to file                 
        grep -e "BC:Z:$BC" $SAM >> ${sampleName}.sam #write barcoded reads              
    fi
    printf "BC:Z:$BC\n" >> /tmp/bc #patterns for unmatched reads                        
done < $BC_FILE

#Write unmatched reads to file                                                          
grep -v -f /tmp/bc $SAM > unmatched.sam #write unmatched reads                          
rm /tmp/bc

This approach, while efficient for a small number of reads, may be inappropriate for larger projects.

Amazon Glacier update

It looks like after 24 hours, my data are all there:

And my bill will be $0.15 this month.

Overall, this was a smooth process. Now, I think we’ll start freezing monthly versions of my Time Machine backups in Glacier. There is one small catch that I managed to miss in my original post. Glacier charges a minimum of three months of data storage, and prorates that amount if you delete your archives earlier. That really just means I’ll be keeping three of the most recent versions of my monthly backups, and deleting anything older than three months.

Cold storage: freezing my backups in Amazon Glacier

A couple of days ago, Amazon sent me an email about a new AWS service called “Glacier“. What a boring name, huh? It’s not caller something sexy like “FlexStore”, or “FireVault”, and that’s by design. The idea behind Glacier is long term, low power (and therefore lower cost) offsite storage. Facebook recently announced that they are moving to a similar solution for backups. You see, spinning platter hard drives take a constant amount of energy to keep running. If you write data to a HDD, and then pull the plug, you cut out the cost of operating the drive until you need to retrieve your data once again. For long-term backups which may never be accessed this is an ideal solution. Unfortunately, Glacier just exists as an API for the moment. Peter Binkley wrote an excellent account of sending some data to the Glacier, and then retrieving it. I think I’ll do the same, using a Java application called glacierFreezer.

The first step for me is to create a public/private key pair through AWS IAM. I just created a user named “glacier” with access to all AWS functions, except IAM. This should be fairly safe, assuming I don’t kick off all kinds of unwanted services and rack up a huge bill. glacierFreezer also needs an AWS SimpleDB to interact with, and seems capable of creating a SimpleDB domain for us, but I’ll create one anyway with boto.

$ export AWS_ACCESS_KEY_ID=Your_AWS_Access_Key_ID
$ export AWS_SECRET_ACCESS_KEY=Your_AWS_Secret_Access_Key

$ python2
Python 2.7.3 (default, Apr 24 2012, 00:00:54) 
[GCC 4.7.0 20120414 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import boto
>>> connection = boto.connect_sdb()
>>> domain = sdb_connection.create_domain('glacierPhotos')
>>> domain = connection.create_domain('glacierPhotos')
>>> domains = connection.get_all_domains()
>>> domains
[Domain:glacierPhotos]
>>>quit()

Next, after downloading the glacierFreezer jar, I need to think about what to back up. This will be “fire insurance” for our digital valuables, so let’s pick something I would lose in a fire: wedding photos. In the Glacier management console, I create a “vault” named “photos”. Now, let’s test our new backup system with a script that will send my wedding photos to Glacier:

#!/bin/sh                                                                                                                           
dir=$1

for file in `find $dir`
do
    java -jar glacierFreezer.jar 'accessKey' 'secretAccessKey' glacierPhotos photos $file
done

Run the script, and wait while 15 GB of jpegs are uploaded to Glacier. Theoretically, this will cost me $0.15 per month. The cost of retrieval is higher, and it seems like there is some difficulty in predicting the exact cost at this point in time. Since this is a sort of offsite data insurance plan, the cost of retrieval will be worth it.