importing
modulesimport time
time.time()
1383091943.203496
time.strftime('%c', time.localtime())
'Wed Oct 30 00:12:23 2013'
import math
math.log10(1000)
3.0
import this
Using the open
function, we can open any file for reading or writing.
fasta_file = open('hemoglobin_beta.fasta', 'r')
fasta_file
<open file 'hemoglobin_beta.fasta', mode 'r' at 0x19e7b70>
open
takes one of several modes:
We can read the entire contents of a file with open.readlines
fasta_contents = fasta_file.readlines()
fasta_file.close() ## close open file
len(fasta_contents)
39
The readlines
method reads the entire file as list of strings, including the line terminator \n
.
print fasta_contents
['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA\n', 'TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA\n', 'GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG\n', 'CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA\n', 'TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA\n', 'CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT\n', 'GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC\n', 'CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA\n', 'AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT\n', 'CAAAATGACAAAGACCTTTGAAAATCTGTCCTACTAATAAAAGGCATTTACTTTCACTGC\n', '>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA\n', 'ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA\n', 'GGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC\n', 'AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG\n', 'CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC\n', 'TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT\n', 'CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA\n', 'CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA\n', 'CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT\n', 'GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC\n', '>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA\n', 'ACACTTGCTTCTGACACAACCGTGTTCACTAGCAACTACACAAACAGACACCATGCTGACTGCTGAGGAG\n', 'AAGGCTGCCGTCACCGCCTTTTGGGGCAAGGTGAAAGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGC\n', 'TGCTGGTTGTCTACCCCTGGACTCAGAGGTTCTTTGAGTCCTTTGGGGACTTGTCCACTGCTGATGCTGT\n', 'TATGAACAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGCTAGATTCCTTTAGTAATGGCATGAAGCAT\n', 'CTCGATGACCTCAAGGGCACCTTTGCTGCGCTGAGTGAGCTGCACTGTGATAAGCTGCATGTGGATCCTG\n', 'AGAACTTCAAGCTCCTGGGCAACGTGCTAGTGGTTGTGCTGGCTCGCAATTTTGGCAAGGAATTCACCCC\n', 'GGTGCTGCAGGCTGACTTTCAGAAGGTGGTGGCTGGTGTGGCCAATGCCCTGGCCCACAGATATCATTAA\n', 'GCTCCCTTTCCTGCTTTCCAGGAAAGGTTTTTTCATCCTCAGAGCCCAAAGATTGAATATGGAAAAATTA\n', 'TGAAGTGTTTTGAGCATCTGGCCTCTGCCTAATAAAGACATTTATTTTCATTGCAAAAAAAAAAAAAAAA\n', 'AAA\n', '>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA\n', 'ACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGACTGCCGTTACCACCCTGTGGGGCAAGGTGAACG\n', 'TGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGA\n', 'TTCCTTTGGGGATCTGTCCTCTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAA\n', 'GTGCTTGGTGCCTTTAGTGATGGCCTGAATCACCTGGACAACCTCAAGGGCACCTTTGCCCAGCTCAGTG\n', 'AGCTGCACTGTGACAAGCTGCATGTGGATCCTGAGAACTTCAAGCTCCTGGGCAACGTGCTGGTGTGTGT\n', 'GCTGGCCCATCACTTTGGCAAAGAATTCACCCCGCAAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGT\n', 'GTGGCTAATGCCCTGGCCCACAAGTACCACTAAGCTCACTTTCTTGCTGTC\n']
[line.strip('\n') for line in fasta_contents][:9]
['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA', 'TGCTTCTGACATAGTTGTGTTGACTCACAAACTCAGAAACAGACACCATGGTGCACCTGACTGATGCTGA', 'GAAGGCTGCTGTTAATGGCCTGTGGGGAAAGGTGAACCCTGATGATGTTGGTGGCGAGGCCCTGGGCAGG', 'CTGCTGGTTGTCTACCCTTGGACCCAGAGGTACTTTGATAGCTTTGGGGACCTGTCCTCTGCCTCTGCTA', 'TCATGGGTAACCCTAAGGTGAAGGCCCATGGCAAGAAGGTGATAAACGCCTTCAATGATGGCCTGAAACA', 'CTTGGACAACCTCAAGGGCACCTTTGCTCATCTGAGTGAACTCCACTGTGACAAGCTGCATGTGGATCCT', 'GAGAACTTCAGGCTCCTGGGCAATATGATTGTGATTGTGTTGGGCCACCACCTGGGCAAGGAATTCACCC', 'CCTGTGCACAGGCTGCCTTCCAGAAGGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTA', 'AACCTCTTTTCCTGCTCTTGTCTTTGTGCAATGGTCAATTGTTCCCAAGAGAGCATCTGTCAGTTGTTGT']
strip
method for each string in our new list.'AGTC\n'.strip() == 'AGTC\n'.strip('\n\r\t ')
for line in fasta_contents:
line = line.strip()
if line[0] == '>':
print 'Header is %s' % (line)
else:
pass
Header is >gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA Header is >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA Header is >gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA Header is >gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA
'>'
at index 0You can also iterate directly on an open file.
fasta_file = open('hemoglobin_beta.fasta', 'r')
for line in fasta_file:
line = line.strip()
if line[0] == '>':
print 'Header is %s' % (line)
else:
pass ## do nothing
fasta_file.close()
Header is >gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA Header is >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA Header is >gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA Header is >gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA
How would you change the previous for
loop to identify only the sequence lines from the FASTA file?
How would you count the number of sequences in a FASTA file?
We can use open
to write a file using a loop.
outfile = open('just_the_headers.fasta', 'w')
for line in fasta_contents:
line = line.strip()
if line[0] == '>':
outfile.write(line + '\n') ## add back a line terminator
outfile.close()
with open('hemoglobin_beta.fasta', 'r') as fasta_file:
for line in fasta_file:
if line[0] == '>':
print "I found a header!"
I found a header! I found a header! I found a header! I found a header!
The with
context handler automatically closes a file after the subsequent code block finishes evaluation.
def get_fasta_headers(fasta):
headers = [] ## an empty list to store found headers
for line in fasta:
if line[0] == '>':
headers.append(line) ## add header to headers list
else:
pass
return headers
fasta_headers = get_fasta_headers(fasta_contents)
fasta_headers
['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA\n', '>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA\n', '>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA\n', '>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA\n']
Functions take arguments and return values.
def function(args):
do_something
return(something)
return
can only be called once, and then the function endsargs
can be any type of objectdef get_fasta_headers(fasta, count=False):
counter = 0 ## counter if we need it
headers = [] ## an empty list to store found headers
for line in fasta:
if line[0] == '>':
headers.append(line) ## add header to headers list
if count:
counter += 1
else:
pass
if count:
return (headers, counter)
return headers
get_fasta_headers(fasta_contents, count=True)
(['>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA\n', '>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA\n', '>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA\n', '>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA\n'], 4)
How can we write a function that takes a DNA string and returns the complementary sequence?
Classes are like a blueprint for an object in Python. They are defined similarly to functions.
class fasta:
def __init__(self, name='', seq=''):
self.name = name
self.seq = seq
def __len__(self):
return len(self.seq)
def complement(self):
pairings = {'a':'t', 't':'a',
'g':'c', 'c':'g'}
comp = ''
for base in self.seq.lower():
comp += pairings[base]
return comp
seq1 = fasta(name='seq1', seq='ATGGCTATGTATGAT')
seq1.complement()
'taccgatacatacta'
So far we have been using Python interactively. This is fine for small, one-time tasks. However, if we need to run a longer or more complicated task we should write a script.
We just write the code in a text editor and save it with a .py
extension.
with open('hemoglobin_beta.fasta', 'r') as fasta_file:
for line in fasta_file:
if line[0] == '>':
print line
Then simply run the script from your shell:
$ python find_headers.py
>gi|17985948|ref|NM_033234.1| Rattus norvegicus hemoglobin, beta (Hbb), mRNA
>gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA
>gi|160358323|ref|NM_173917.2| Bos taurus hemoglobin, beta (HBB), mRNA
>gi|256600241|ref|NM_001164428.1| Macaca mulatta hemoglobin, beta (HBB), mRNA
Write a script that reads this file and converts each sequence from DNA to RNA. Email me the script as well as the resulting sequences!