Biopython – Sequence Operations
Last Updated :
06 Aug, 2021
The Biopython module provides various built-in methods through which we can perform various basic and advanced operations on the sequences. basic operations are very similar to string methods like slicing, concatenation, find, count, strip, split, etc. Some of the advanced operations are listed below
Complement and Reverse Complement: Biopython provides the complement() and reverse_complement() functions which can be used to find the complement of the given nucleotide sequence to get a new sequence, while the complemented sequence can also be reverse complemented to get the original sequence. Below is a simple example for described functions:
Syntax: complement(self)
Return Type: <class ‘Bio.Seq.Seq’>
Python3
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
seq = Seq( 'CTGACTGAAGCT' , IUPAC.ambiguous_dna)
comp = seq.complement()
comp
rev_comp = comp.reverse_complement()
rev_comp
|
Output:
Seq('GACTGACTTCGA', IUPACAmbiguousDNA())
Seq('TCGAAGTCAGTC', IUPACAmbiguousDNA())
In the above example, the complement() method creates the complement of the DNA or RNA sequence, while the reverse_complement() function creates the complement of the sequence and reverses the resultant from left to right.
Bio.Data.IUPACData module of biopython provides the ambiguous_dna_complement variable which is used to perform the complement operations.
Python3
from Bio.Data import IUPACData
import pprint
pprint.pprint(IUPACData.ambiguous_dna_complement)
|
Output:
{
'A': 'T',
'B': 'V',
'C': 'G',
'D': 'H',
'G': 'C',
'H': 'D',
'K': 'M',
'M': 'K',
'N': 'N',
'R': 'Y',
'S': 'S',
'T': 'A',
'V': 'B',
'W': 'W',
'X': 'X',
'Y': 'R'}
GC Content(guanine-cytosine content): GC Content is basically the percentage of nitrogenous bases in DNA or RNA molecule which is either Guanine or Cytosine. It can be predicted by calculating the number of GC nucleotides divided by the total number of nucleotides. Below is a basic example for calculating GC content:
Syntax: Bio.SeqUtils.GC(seq)
Return Type: <class ‘float’>
Python3
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Alphabet import IUPAC
seq = Seq( "CTGACTGAAGCT" , IUPAC.unambiguous_dna)
print (GC(seq))
|
Output:
50.00
Transcription: It is basically a process of converting a DNA into a RNA sequence. An actual biological transcription is a process to perform a reverse complement(GACT -> AGUC) to get the mRNA having DNA as the template strand. In Biopython, the base DNA strand is directly converted to mRNA simply by changing the letter T with U. A simple example is given below :
Syntax: transcribe(self)
Return Type: <class ‘Bio.Seq.Seq’>
Python3
from Bio.Seq import Seq
from Bio.Seq import transcribe
from Bio.Alphabet import IUPAC
dna_seq = Seq( "CTGACTGAAGCT" , IUPAC.unambiguous_dna)
print (transcribe(dna_seq))
rna_seq = transcribe(dna_seq)
print (rna_seq.back_transcribe())
|
Output:-
Seq('CUGACUGAAGCU', IUPACUnambiguousRNA())
Seq('CTGACTGAAGCT', IUPACUnambiguousDNA())
Translation: It is a process of translating a RNA sequence to a protein sequence. The sequence module has h built-in translate() method used for this purpose. If we have to stop translation at the first codon, it is possible by passing to_stop = True parameter to the translation() method.
Biopython uses the translation table provided by The Genetic Codes page of NCBI. The full list of translation table is given below :
Syntax: translate(self, table=’Standard’, stop_symbol=’*’, to_stop=False, cds=False, gap=’-‘)
Return Type: <class ‘Bio.Seq.Seq’>
Python3
from Bio.Data import CodonTable
table = CodonTable.unambiguous_dna_by_name[ "Standard" ]
print (table)
|
Output:
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
A simple example of translation is given below :
Python3
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
rna = Seq( 'UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA' , IUPAC.unambiguous_rna)
print (rna)
print (rna.translate())
print (rna.translate(to_stop = True ))
|
Output:
Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPACUnambiguousRNA())
Seq('YRIVFPG*SCAR', HasStopCodon(IUPACProtein(), '*'))
Seq('YRIVFPG', IUPACProtein())
Share your thoughts in the comments
Please Login to comment...