Biopython – Sequence Operations

Last Updated : 06 Aug, 2021

The Biopython module provides various built-in methods through which we can perform various basic and advanced operations on the sequences. basic operations are very similar to string methods like slicing, concatenation, find, count, strip, split, etc. Some of the advanced operations are listed below

Complement and Reverse Complement: Biopython provides the complement() and reverse_complement() functions which can be used to find the complement of the given nucleotide sequence to get a new sequence, while the complemented sequence can also be reverse complemented to get the original sequence. Below is a simple example for described functions:

Syntax: complement(self)

Return Type: <class ‘Bio.Seq.Seq’>

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC 
 
# Creating sequence
seq = Seq('CTGACTGAAGCT', IUPAC.ambiguous_dna) 
 
# Creating complement of the sequence and print
comp = seq.complement() 
comp
 
# Creating reverse complement and print
rev_comp = comp.reverse_complement()
rev_comp

Output:

Seq('GACTGACTTCGA', IUPACAmbiguousDNA()) 
Seq('TCGAAGTCAGTC', IUPACAmbiguousDNA())

In the above example, the complement() method creates the complement of the DNA or RNA sequence, while the reverse_complement() function creates the complement of the sequence and reverses the resultant from left to right.

Bio.Data.IUPACData module of biopython provides the ambiguous_dna_complement variable which is used to perform the complement operations.

Python3

# Import libraries
from Bio.Data import IUPACData 
import pprint 
 
# Printing the dataset
pprint.pprint(IUPACData.ambiguous_dna_complement)

Output:

{
   'A': 'T',
   'B': 'V',
   'C': 'G',
   'D': 'H',
   'G': 'C',
   'H': 'D',
   'K': 'M',
   'M': 'K',
   'N': 'N',
   'R': 'Y',
   'S': 'S',
   'T': 'A',
   'V': 'B',
   'W': 'W',
   'X': 'X',
   'Y': 'R'}

GC Content(guanine-cytosine content): GC Content is basically the percentage of nitrogenous bases in DNA or RNA molecule which is either Guanine or Cytosine. It can be predicted by calculating the number of GC nucleotides divided by the total number of nucleotides. Below is a basic example for calculating GC content:

Syntax: Bio.SeqUtils.GC(seq)

Return Type: <class ‘float’>

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Alphabet import IUPAC
 
# Creating sequence
seq = Seq("CTGACTGAAGCT", IUPAC.unambiguous_dna) 
 
# Getting GC count
print(GC(seq))

Output:

50.00

Transcription: It is basically a process of converting a DNA into a RNA sequence. An actual biological transcription is a process to perform a reverse complement(GACT -> AGUC) to get the mRNA having DNA as the template strand. In Biopython, the base DNA strand is directly converted to mRNA simply by changing the letter T with U. A simple example is given below :

Syntax: transcribe(self)

Return Type: <class ‘Bio.Seq.Seq’>

Python3

# Import Libraries
from Bio.Seq import Seq
from Bio.Seq import transcribe 
from Bio.Alphabet import IUPAC
 
# Creating sequence
dna_seq = Seq("CTGACTGAAGCT", IUPAC.unambiguous_dna)
 
# Transcription to RNA
print(transcribe(dna_seq))
 
# Reverse Transcription to DNA
rna_seq = transcribe(dna_seq) 
print(rna_seq.back_transcribe())

Output:-

Seq('CUGACUGAAGCU', IUPACUnambiguousRNA())
Seq('CTGACTGAAGCT', IUPACUnambiguousDNA())

Translation: It is a process of translating a RNA sequence to a protein sequence. The sequence module has h built-in translate() method used for this purpose. If we have to stop translation at the first codon, it is possible by passing to_stop = True parameter to the translation() method.

Biopython uses the translation table provided by The Genetic Codes page of NCBI. The full list of translation table is given below :

Syntax: translate(self, table=’Standard’, stop_symbol=’*’, to_stop=False, cds=False, gap=’-‘)
Return Type: <class ‘Bio.Seq.Seq’>

Python3

# import libraries
from Bio.Data import CodonTable
 
# Creating table
table = CodonTable.unambiguous_dna_by_name["Standard"] 
 
# Print table
print(table)

Output:

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

A simple example of translation is given below :

Python3

# Import Libraries
from Bio.Seq import Seq 
from Bio.Alphabet import IUPAC
 
# Creating sequence
rna = Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPAC.unambiguous_rna)
print(rna)
 
# Translating RNA
print(rna.translate())
 
# Stop translation to first stop codon ( asterisk '*' is stop codon)
print(rna.translate(to_stop = True))

Output:

Seq('UACCGGAUUGUUUUCCCGGGCUGAUCCUGUGCCCGA', IUPACUnambiguousRNA())
Seq('YRIVFPG*SCAR', HasStopCodon(IUPACProtein(), '*'))
Seq('YRIVFPG', IUPACProtein())

Suggest improvement

Biopython - Sequence File Formats

Share your thoughts in the comments

Biopython – Sequence Operations

Python3

Python3

Python3

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?