The process of coding insertions and deletions in a multi-species DNA alignment such that the presence of these indels can be used as additional information in a phylogenetic investigation is typically referred to as “indel coding”. Multiple indel coding schemes exist, with a scheme proposed by Simmons and Ochoterena (2000; often referred to as “simple indel coding scheme”) representing the most widely employed scheme.
Phylogenomic datasets require users to conduct indel coding for dozens, if not hundreds of DNA alignments. However, automating the process of indel coding based on the simple indel coding scheme is fairly straightforward:
We use FHCRC’s Seqmagick to convert NEXUS-formatted alignments to FASTA-formatted alignments, the perl script 2matrix (Salinas and Little 2014) to conduct the indel coding, and then simply conduct some file hygiene afterward.
INF=atpF_intron_aligned_b4manAdj_BETB_exhot.nex # Convert NEXUS to FASTA seqmagick convert $INF ${INF%.nex*}.fasta # Conduct indel coding perl /path_to_git/2matrix/2matrix.pl -i ${INF%.nex*}.fasta \
-o n,p -n ${INF%.nex*}__SIC # File hygiene mv ${INF%.nex*}__SIC.part ${INF%.nex*}__SIC.part.txt rm ${INF%.nex*}__SIC.garli.nex ${INF%.nex*}__SIC.conf