Overview

  • Importing modules
  • Reading and writing files
  • Functions
  • Classes
  • Writing scripts

Importing Modules

  • Python comes with a fairly large number of functions and data structures "out of the box".
  • Some things are missing though
  • We can add functionality by importing modules
In [40]:
import time
time.time()
Out[40]:
1383091943.203496
In [41]:
time.strftime('%c', time.localtime())
Out[41]:
'Wed Oct 30 00:12:23 2013'
In [42]:
import math
math.log10(1000)
Out[42]:
3.0
In [43]:
import this

Reading and writing files

Using the open function, we can open any file for reading or writing.

In [44]:
fasta_file = open('hemoglobin_beta.fasta', 'r')
fasta_file
Out[44]:
<open file 'hemoglobin_beta.fasta', mode 'r' at 0x19e7b70>

open takes one of several modes:

  • 'r' for reading
  • 'w' for writing
  • 'a' for appending writes to the end of a file
  • 'r+' for both read and write
  • 'rb' and 'wb' for binary files
  • mostly this is unnecessary when your files are text

We can read the entire contents of a file with open.readlines

In [45]:
fasta_contents = fasta_file.readlines()
fasta_file.close() ## close open file
len(fasta_contents)
Out[45]:
39

The readlines method reads the entire file as list of strings, including the line terminator \n.

In [46]:
print fasta_contents
['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA\n', 'TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA\n', 'GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG\n', 'CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA\n', 'TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA\n', 'CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT\n', 'GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC\n', 'CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA\n', 'AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT\n', 'CAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC\n', '>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA\n', 'ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA\n', 'GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC\n', 'AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG\n', 'CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC\n', 'TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT\n', 'CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA\n', 'CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA\n', 'CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT\n', 'GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC\n', '>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA\n', 'ACACTTGCTTCTGACACAACCGTGTTCACTAGCAACTACACAAACAGACACCATGCTGACTGCTGAGGAG\n', 'AAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGC\n', 'TGCTGGTTGTCTACCCCTGGACTCAGAGGTTCTTTGAGTCCTTTGGGGACTTGTCCACTGCTGATGCTGT\n', 'TATGAACAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGCTAGATTCCTTTAGTAATGGCATGAAGCAT\n', 'CTCGATGACCTCAAGGGCACCTTTGCTGCGCTGAGTGAGCTGCACTGTGATAAGCTGCATGTGGATCCTG\n', 'AGAACTTCAAGCTCCTGGGCAACGTGCTAGTGGTTGTGCTGGCTCGCAATTTTGGCAAGGAATTCACCCC\n', 'GGTGCTGCAGGCTGACTTTCAGAAGGTGGTGGCTGGTGTGGCCAATGCCCTGGCCCACAGATATCATTAA\n', 'GCTCCCTTTCCTGCTTTCCAGGAAAGGTTTTTTCATCCTCAGAGCCCAAAGATTGAATATGGAAAAATTA\n', 'TGAAGTGTTTTGAGCATCTGGCCTCTGCCTAATAAAGACATTTATTTTCATTGCAAAAAAAAAAAAAAAA\n', 'AAA\n', '>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA\n', 'ACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGACTGCCGTTACCACCCTGTGGGGCAAGGTGAACG\n', 'TGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGA\n', 'TTCCTTTGGGGATCTGTCCTCTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAA\n', 'GTGCTTGGTGCCTTTAGTGATGGCCTGAATCACCTGGACAACCTCAAGGGCACCTTTGCCCAGCTCAGTG\n', 'AGCTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGCTCCTGGGCAACGTGCTGGTGTGTGT\n', 'GCTGGCCCATCACTTTGGCAAAGAATTCACCCCGCAAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGT\n', 'GTGGCTAATGCCCTGGCCCACAAGTACCACTAAGCTCACTTTCTTGCTGTC\n']

In [47]:
[line.strip('\n') for line in fasta_contents][:9]
Out[47]:
['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA',
 'TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA',
 'GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG',
 'CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA',
 'TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA',
 'CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT',
 'GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC',
 'CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA',
 'AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT']
  • We can use the strip method for each string in our new list.
  • 'AGTC\n'.strip() == 'AGTC\n'.strip('\n\r\t ')
  • The method default is to strip all "whitespace" characters
In [48]:
for line in fasta_contents:
    line = line.strip()
    if line[0] == '>':
        print 'Header is %s' % (line)
    else:
        pass
Header is >gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA
Header is >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
Header is >gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA
Header is >gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA

  • We can loop through the list and identify the FASTA header lines by '>' at index 0

You can also iterate directly on an open file.

In [49]:
fasta_file = open('hemoglobin_beta.fasta', 'r')
for line in fasta_file:
    line = line.strip()
    if line[0] == '>':
        print 'Header is %s' % (line)
    else:
        pass ## do nothing
fasta_file.close()
Header is >gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA
Header is >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
Header is >gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA
Header is >gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA

Exercise 1: Identify the sequence lines in a FASTA file

How would you change the previous for loop to identify only the sequence lines from the FASTA file?

Exercise 2: Count the sequences in a FASTA file

How would you count the number of sequences in a FASTA file?

We can use open to write a file using a loop.

In [50]:
outfile = open('just_the_headers.fasta', 'w')
for line in fasta_contents:
    line = line.strip()
    if line[0] == '>':
        outfile.write(line + '\n') ## add back a line terminator
outfile.close()

Context handler

In [51]:
with open('hemoglobin_beta.fasta', 'r') as fasta_file:
    for line in fasta_file:
        if line[0] == '>':
            print "I found a header!"
I found a header!
I found a header!
I found a header!
I found a header!

The with context handler automatically closes a file after the subsequent code block finishes evaluation.

Functions

In [52]:
def get_fasta_headers(fasta):
    headers = [] ## an empty list to store found headers
    for line in fasta:
        if line[0] == '>':
            headers.append(line) ## add header to headers list
        else:
            pass 
    return headers
In [53]:
fasta_headers = get_fasta_headers(fasta_contents)
fasta_headers
Out[53]:
['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA\n',
 '>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA\n',
 '>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA\n',
 '>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA\n']

Functions take arguments and return values.

def function(args):
    do_something
    return(something)
  • return can only be called once, and then the function ends
  • args can be any type of object
In [54]:
def get_fasta_headers(fasta, count=False):
    counter = 0 ## counter if we need it
    headers = [] ## an empty list to store found headers
    for line in fasta:
        if line[0] == '>':
            headers.append(line) ## add header to headers list
            if count:
                counter += 1
        else:
            pass 
    if count:
        return (headers, counter)
    return headers
In [55]:
get_fasta_headers(fasta_contents, count=True)
Out[55]:
(['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA\n',
  '>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA\n',
  '>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA\n',
  '>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA\n'],
 4)

Exercise 3: complement DNA sequence

How can we write a function that takes a DNA string and returns the complementary sequence?

Classes

Classes are like a blueprint for an object in Python. They are defined similarly to functions.

In [56]:
class fasta:
    def __init__(self, name='', seq=''):
        self.name = name
        self.seq = seq
    def __len__(self):
        return len(self.seq)
    def complement(self):
        pairings = {'a':'t', 't':'a',
                    'g':'c', 'c':'g'}
        comp = ''
        for base in self.seq.lower():
            comp += pairings[base]
        return comp
In [57]:
seq1 = fasta(name='seq1', seq='ATGGCTATGTATGAT')
seq1.complement()
Out[57]:
'taccgatacatacta'

Writing scripts

So far we have been using Python interactively. This is fine for small, one-time tasks. However, if we need to run a longer or more complicated task we should write a script.

We just write the code in a text editor and save it with a .py extension.

with open('hemoglobin_beta.fasta', 'r') as fasta_file:
    for line in fasta_file:
        if line[0] == '>':
            print line

Then simply run the script from your shell:

$ python find_headers.py
>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA
>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA
>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA

Homework!

Write a script that reads this file and converts each sequence from DNA to RNA. Email me the script as well as the resulting sequences!