Michael Grünstäudl (Gruenstaeudl), PhD

Improved sorting of numbered DNA sequences

Keeping things orderly

Like many other molecular phylogeneticists, I often work with massive numbers of FASTA-formatted DNA sequences. Occasionally, the names of these sequences are numbered in a simplistic fashion (i.e., 1,2,…,48,49), which has the unfortunate side-effect of messing up the intended order of the sequences when sorted numerically, as the sequences 10, 11, …, 19 are followed (and not preceded!) by 1.

To overcome this issue, it is necessary to prepend all sequence names of numbers shorter than the maximum number of digits with one or more leading zeros. The resulting sequence names would, thus, be: 01, 02, …, 48, 49.

Here is a quick Python script that performs the prepending of leading zeros (assuming the FASTA-specific larger-signs are in different lines than the sequence names):

#!/usr/bin/env python2.7
import sys
inFn = sys.argv[1]
delim = '_'
with open(inFn, 'r') as reader, \
open(inFn+'.adjusted', 'w') as writer:
    lines = reader.readlines()
    for line in lines:
        if delim in line and line[0].isdigit():
            lparts = [e+delim for e in \
                      line.split(delim) if e]
            lparts[-1] = lparts[-1].rstrip(delim)
            if len(lparts[0]) <= 3:
                line = lparts[0].rjust(4, '0') + \
                       ''.join(lparts[1:])
            writer.write(line)
        else:
            writer.write(line)

Der Beitrag wurde am Thursday, den 12. September 2019 um 15:42 Uhr von Michael Grünstäudl veröffentlicht und wurde unter bioinformatics abgelegt. Sie können die Kommentare zu diesem Eintrag durch den RSS 2.0 Feed verfolgen. Sie können einen Kommentar schreiben, oder einen Trackback auf Ihrer Seite einrichten.

Freie Universität Berlin

Service-Navigation

Postdoctoral Researcher at the Freie Universität Berlin

Improved sorting of numbered DNA sequences

Leave a Reply

Archives