Keeping things orderly
Like many other molecular phylogeneticists, I often work with massive numbers of FASTA-formatted DNA sequences. Occasionally, the names of these sequences are numbered in a simplistic fashion (i.e., 1,2,…,48,49), which has the unfortunate side-effect of messing up the intended order of the sequences when sorted numerically, as the sequences 10, 11, …, 19 are followed (and not preceded!) by 1.
To overcome this issue, it is necessary to prepend all sequence names of numbers shorter than the maximum number of digits with one or more leading zeros. The resulting sequence names would, thus, be: 01, 02, …, 48, 49.
Here is a quick Python script that performs the prepending of leading zeros (assuming the FASTA-specific larger-signs are in different lines than the sequence names):
#!/usr/bin/env python2.7 import sys inFn = sys.argv[1] delim = '_' with open(inFn, 'r') as reader, \ open(inFn+'.adjusted', 'w') as writer: lines = reader.readlines() for line in lines: if delim in line and line[0].isdigit(): lparts = [e+delim for e in \ line.split(delim) if e] lparts[-1] = lparts[-1].rstrip(delim) if len(lparts[0]) <= 3: line = lparts[0].rjust(4, '0') + \ ''.join(lparts[1:]) writer.write(line) else: writer.write(line)