DNA to Protein in Python 3

4

Translation Theory : DNA ⇒ RNA ⇒ Protein

Life depends on the ability of cells to store, retrieve, and translate genetic instructions.These instructions are needed to make and maintain living organisms. For a long time, it was not clear what molecules were able to copy and transmit genetic information. We now know that this information is carried by the dioxyribonucleic acid or DNA in all living things.
DNA: DNA is a discrete code physically present in almost every cell of an organism. We can think of DNA as a one dimensional string of characters with four characters to choose from. These characters are A, C, G, and T. They stand for the first letters with the four nucleotides used to construct DNA. The full names of these nucleotides are Adenine, Cytosine, Guanine, and Thymine. Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things.

Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins. We can think of DNA, when read as sequences of three letters, as a dictionary of life.
Aim: Convert a given sequence of DNA into its Protein equivalent.
Source: Download a DNA strand as a text file from a public web-based repository of DNA sequences from NCBI.The Nucleotide sample is ( NM_207618.2 ), which can be found here.To download the file :


Steps: Required steps to convert DNA sequence to a sequence of Amino acids are :

1. Code to translate the DNA sequence to a sequence of Amino acids where each Amino acid is
   represented by a unique letter.
2. Download the Amino acid sequence from NCBI to check our solution.

Coding Translation

The very first step is to put the original unaltered DNA sequence text file into the working path directory.Check your working path directory in the Python shell,

>>>pwd

Next, we need to open the file in Python and read it. By default, the text file contains some unformatted hidden characters. These hidden characters such as “/n” or “/r” needs to be formatted and removed. So we use replace() function and get the altered DNA sequence txt file from the Original txt file.

inputfile ="DNA_sequence_original.txt"
f = open(inputfile, "r")
seq = f.read()
 
seq = seq.replace("\n", "") 
seq = seq.replace("\r", "")


Next, we will build a function called translate() which will convert the altered DNA sequence into its Protein equivalent and return it. We will feed the altered DNA sequence as a parameter to the function.

def translate(seq):
     
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                 
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein =""
    if len(seq)%3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein+= table[codon]
    return protein

The table in the code above is for reference and can be found in biology manuals. Since triplet nucleotide called the codon forms a single amino acid, so we check if the altered DNA sequence is divisible by 3 in ( if len(seq)%3 == 0: ). Next, the code is self explanatory where we form codons and match them with the Amino acids in the table. Atlast, we form the Amino acid sequence also called the Protein and return it.
The last step is to match our Amino Acid sequence with that to the original one found on the NCBI website. We will compare both of the Amino acid sequences in Python, character by character and return true if both are exactly the same copy.
First download the unaltered amino acid sequence txt file and open it in Python. We will build a function called read_seq() to remove the unwanted characters and form the altered amino acid’s sequence txt file.

def read_seq(inputfile):
    with open(inputfile, "r") as f:
        seq = f.read()
    seq = seq.replace("\n", "")
    seq = seq.replace("\r", "")
    return seq

The last step is to compare both the files and check if both are the same.If the output is true, we have succeeded in translating DNA to Protein.

Final Code

# Python program to conver
# altered DNA to protein

inputfile ="DNA_sequence_original.txt" 
f = open(inputfile, "r")
seq = f.read()

seq = seq.replace("\n", "") 
seq = seq.replace("\r", "")

def translate(seq):
    
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                 
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
    protein =""
    if len(seq)%3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein+= table[codon]
    return protein
def read_seq(inputfile):
    with open(inputfile, "r") as f:
        seq = f.read()
    seq = seq.replace("\n", "")
    seq = seq.replace("\r", "")
    return seq

prt = read_seq("amino_acid_sequence_original.txt")
dna = read_seq("DNA_sequence_original.txt")


p = translate(dna[20:935])
p == prt 

Output : True

Reference :

This article is contributed by Amartya Ranjan Saikia. If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.

GATE CS Corner    Company Wise Coding Practice

Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.

Recommended Posts:



4 Average Difficulty : 4/5.0
Based on 1 vote(s)










Writing code in comment? Please use ide.geeksforgeeks.org, generate link and share the link here.