Compound but not complex
The Biopython manual informs the alert reader that ‘join’ locations of EMBL/GenBank files can be handled by CompoundLocation objects. This class of objects is a special object class in Biopython and very straight forward to operate.
Assume, for example, the following DNA sequence:
>>> from Bio.Seq import Seq
>>> s = Seq("AAATGAAATCAATAAAA")
>>> s
Seq('AAATGAAATCAATAAAA', Alphabet())
This example sequence contains three exons (each 3-bp long), which are flanked by 2-bp long spacers that have the sequence “AA”. Taken together (i.e., ‘joined’), they translate to the following protein consisting of two aminoacids: “M I *” (where the asterisk indicates a stop codon). How can I extract the exons from the above sequence?
First, you set up three FeatureLocation objects:
>>> from Bio.SeqFeature import FeatureLocation, CompoundLocation
>>> f1 = Bio.SeqFeature.FeatureLocation(2,5)
>>> f2 = Bio.SeqFeature.FeatureLocation(7,10)
>>> f3 = Bio.SeqFeature.FeatureLocation(12,15)
>>> f1
FeatureLocation(ExactPosition(2), ExactPosition(5))
Second, you convert the FeatureLocation objects to a CompoundLocation object:
>>> f = CompoundLocation([f1,f2,f3])
>>> f
CompoundLocation([FeatureLocation(ExactPosition(2), ExactPosition(5)), FeatureLocation(ExactPosition(7), ExactPosition(10)), FeatureLocation(ExactPosition(12), ExactPosition(15))], 'join')
Third, you extract the exons from the sequence via the CompoundLocation object:
>>> s2 = f.extract(s)
>>> s2
Seq('ATGATCTAA', Alphabet())
Finally, you translate the extracted DNA sequence:
>>> s2.translate()
Seq('MI*', HasStopCodon(ExtendedIUPACProtein(), '*'))
QED.