DNA to Protein in Python 3

Last Updated : 20 Mar, 2024

Translation Theory : DNA ? RNA ? Protein

Life depends on the ability of cells to store, retrieve, and translate genetic instructions.These instructions are needed to make and maintain living organisms. For a long time, it was not clear what molecules were able to copy and transmit genetic information. We now know that this information is carried by the deoxyribonucleic acid or DNA in all living things.
DNA: DNA is a discrete code physically present in almost every cell of an organism. We can think of DNA as a one dimensional string of characters with four characters to choose from. These characters are A, C, G, and T. They stand for the first letters with the four nucleotides used to construct DNA. The full names of these nucleotides are Adenine, Cytosine, Guanine, and Thymine. Each unique three character sequence of nucleotides, sometimes called a nucleotide triplet, corresponds to one amino acid. The sequence of amino acids is unique for each type of protein and all proteins are built from the same set of just 20 amino acids for all living things.

Instructions in the DNA are first transcribed into RNA and the RNA is then translated into proteins. We can think of DNA, when read as sequences of three letters, as a dictionary of life.
Aim: Convert a given sequence of DNA into its Protein equivalent.
Source: Download a DNA strand as a text file from a public web-based repository of DNA sequences from NCBI.The Nucleotide sample is ( NM_207618.2 ), which can be found here.To download the file :

Steps: Required steps to convert DNA sequence to a sequence of Amino acids are :

1. Code to translate the DNA sequence to a sequence of Amino acids where each Amino acid is
   represented by a unique letter.
2. Download the Amino acid sequence.

Coding Translation

The very first step is to put the original unaltered DNA sequence text file into the working path directory.Check your working path directory in the Python shell,

>>>pwd

Next, we need to open the file in Python and read it. By default, the text file contains some unformatted hidden characters. These hidden characters such as “/n” or “/r” needs to be formatted and removed. So we use replace() function and get the altered DNA sequence txt file from the Original txt file.

Python

inputfile ="DNA_sequence_original.txt"
f = open(inputfile, "r") 
seq = f.read() 
   
seq = seq.replace("\n", "")  
seq = seq.replace("\r", "") 

Next, we will build a function called translate() which will convert the altered DNA sequence into its Protein equivalent and return it. We will feed the altered DNA sequence as a parameter to the function.

Python

def translate(seq): 
       
    table = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 
    protein ="" 
    if len(seq)%3 == 0: 
        for i in range(0, len(seq), 3): 
            codon = seq[i:i + 3] 
            protein+= table[codon] 
    return protein 

The table in the code above is for reference and can be found in biology manuals. Since triplet nucleotide called the codon forms a single amino acid, so we check if the altered DNA sequence is divisible by 3 in ( if len(seq)%3 == 0: ). Next, the code is self explanatory where we form codons and match them with the Amino acids in the table. Atlast, we form the Amino acid sequence also called the Protein and return it.

The last step is to match our Amino Acid sequence with that to the original one found on the NCBI website. We will compare both of the Amino acid sequences in Python, character by character and return true if both are exactly the same copy.
First download the unaltered amino acid sequence txt file and open it in Python. We will build a function called read_seq() to remove the unwanted characters and form the altered amino acid’s sequence txt file.

Python

def read_seq(inputfile): 
    with open(inputfile, "r") as f: 
        seq = f.read() 
    seq = seq.replace("\n", "") 
    seq = seq.replace("\r", "") 
    return seq 

The last step is to compare both the files and check if both are the same.If the output is true, we have succeeded in translating DNA to Protein.

Final Code

Python

# Python program to convert 
# altered DNA to protein 
  
inputfile ="DNA_sequence_original.txt" 
f = open(inputfile, "r") 
seq = f.read() 
  
seq = seq.replace("\n", "")  
seq = seq.replace("\r", "") 
  
def translate(seq): 
      
    table = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                  
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', 
    } 
    protein ="" 
    if len(seq)%3 == 0: 
        for i in range(0, len(seq), 3): 
            codon = seq[i:i + 3] 
            protein+= table[codon] 
    return protein 
def read_seq(inputfile): 
    with open(inputfile, "r") as f: 
        seq = f.read() 
    seq = seq.replace("\n", "") 
    seq = seq.replace("\r", "") 
    return seq 
  
prt = read_seq("amino_acid_sequence_original.txt") 
dna = read_seq("DNA_sequence_original.txt") 
  
  
p = translate(dna[20:935]) 
p == prt