Skip to content
Related Articles

Related Articles

Improve Article

Biopython – Sequence input/output

  • Last Updated : 22 Oct, 2020

Biopython has an inbuilt Bio.SeqIO module which provides functionalities to read and write sequences from or to a file respectively. Bio.SeqIO supports nearly all file handling formats used in Bioinformatics. Biopython strictly follows single approach to represent the parsed data sequence to the user with the SeqRecord object.

SeqRecord

SeqRecord object provided by the Bio.SeqRecord module holds the metadata of the sequence as well as the information about the sequence. Some main data information are listed below :

RecordDescription
seqAn actual sequence to be parsed.
idPrimary identity of the sequence, by default it is string type
nameThe name of the sequence, by default it is string type.
descriptionDisplays the information about the sequence in human-readable format.
annotationsDictionary containing additional information related to the sequence.

Reading Sequence:

Biopython Seq module has a built-in read() method which takes a sequence file and turns it into a single SeqRecord according to the file format. It is able to parse sequence files having exactly one record, if the file has no records or more than one record then an exception is raised. Syntax and arguments of the read() method are given below :

Bio.SeqIO.read(handle, format, alphabet=None)
Arguments Description
handle  Handle to file or takes filename as string(older versions only take handle)
format  File; format as a string
alphabetOptional parameter, used when sequence type is not automatically inferred from file(ex. format = “fasta”).

Python3




# Import libraries
from Bio import SeqIO
  
# Reading file
record = SeqIO.read("sequence.gb", "genbank")
  
# Showing records
print("ID: %s" % record.id)
print("Sequence length: %i" % len(record))
print("Sequence description: %s" % record.description)

Output:



Prasing Sequence:

The Parse() method provided by the Bio.Seq module is used when we have to read multiple records from the handle. It basically converts the sequence file into an iterator which returns the SeqRecords. If the file contains string data then it must be converted to handle to parse it. The file formats where alphabet can’t be determined, it is useful to specify the alphabet explicitly(ex. FASTA). Syntax and arguments of parse() method are given below :

Bio.SeqIO.parse(handle, format, alphabet=None)
ArgumentsDescription
handleHandle to file or takes filename as string(older versions only take handle)
formatFile format as a string
alphabetThe optional parameter, used when sequence type is not automatically inferred from file(ex. format = “fasta”).

Python3




# Import libraries
from Bio import SeqIO
  
# Parsing file
filename = "sequence.fasta"
for record in SeqIO.parse(filename, "fasta"):
  
    # Showing records
    print("ID: %s" % record.id)
    print("Sequence length: %i" % len(record))
    print("Sequence description: %s" % record.description)

Output :

Writing to Sequence:

For writing to the file Bio.Seq module has a write() method, which writes the set of sequences to the file and returns an integer representing the number of records written. Ensure to close the handle after calling the handle else data gets flushed to disk. Syntax and arguments of write() method are given below :

Bio.SeqIO.write(sequences, handle, format)
ArgumentsDescription
sequencesList or iterator of SeqRecord object(or single SeqRecord in Biopython version 1.54 or later)
handleHandle to file or takes filename as string(older versions only take handle)
formatFile format to write as a lowercase string

Note: To download files click here

Python3




# Import libraries
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
  
rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
                     + "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
                     + "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"),
                 id="gi|14150838|gb|AAK54648.1|AF376133_1",
                 description="chalcone synthase [Cucumis sativus]")
  
rec2 = SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
                     + "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
                     + "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
                     + "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
                     + "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
                     + "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
                     + "TGEGLEWGVLFGFGPGLTVETVVLHSVAT"),
                 id="gi|13925890|gb|AAK49457.1|",
                 description="chalcone synthase [Nicotiana tabacum]")
sequences = [rec1, rec2]
  
# Writing to file
with open("example.fasta", "w") as output_handle:
    SeqIO.write(sequences, output_handle, "fasta")
  
for record in SeqIO.parse("example.fasta", "fasta"):
    print("ID %s" % record.id)
    print("Sequence length %i" % len(record))

Output:

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :