Michael Grünstäudl (Gruenstaeudl), PhD

Researcher at the Freie Universität Berlin

Improved sorting of numbered DNA sequences

Keeping things orderly

Like many other molecular phylogeneticits, I often work with massive numbers of FASTA-formatted DNA sequences. Occasionally, the names of these sequences are numbered in a simplistic fashion (i.e., 1,2,…,48,49), which has the unfortunate side-effect of messing up the intended order of the sequences when sorted numerically, as the sequences 10, 11, …, 19 are followed (and not preceded!) by 1.

To overcome this issue, it is necessary to prepend all sequence names of numbers shorter than the maximum number of digits with one or more leading zeros. The resulting sequence names would, thus, be: 01, 02, …, 48, 49.

Here is a quick Python script that performs the prepending of leading zeros (assuming the FASTA-specific larger-signs are in different lines than the sequence names):

#!/usr/bin/env python2.7
import sys
inFn = sys.argv[1]
delim = '_'
with open(inFn, 'r') as reader, \
open(inFn+'.adjusted', 'w') as writer:
    lines = reader.readlines()
    for line in lines:
        if delim in line and line[0].isdigit():
            lparts = [e+delim for e in \
                      line.split(delim) if e]
            lparts[-1] = lparts[-1].rstrip(delim)
            if len(lparts[0]) <= 3:
                line = lparts[0].rjust(4, '0') + \
                       ''.join(lparts[1:])
            writer.write(line)
        else:
            writer.write(line)

 

On the issue of file formats during DNA sequence submissions to GenBank

Ramblings on an important topic

A series of software tools exist that allow users to conduct submissions of DNA sequences to NCBI GenBank, but file conversion represents a recurring challenge for those submissions. Similar to DNA sequence submissions to ENA, GenBank provides a wide range of options to upload annotated DNA sequences in a custom format, including an interactive submission portal, a web-based tool (BankIt) and two stand-alone software solutions (Sequin, tbl2asn), the latter of which operates on a command-line basis.

However, for several of these tools (e.g., BankIt, Sequin, tbl2asn), the user must provide tab-delimited data tables (“five-column tables”) in order to provide information regarding gene annotations. Five-column tables display a complex syntax and, thus, require considerable bioinformatic knowledge to generate. The methodological gap between GenBank-formatted flatfiles and Sequin-formatted submission files is partially bridged by the perl-script gbf2tbl.pl, which takes GenBank-formatted flatfiles as input, extracts the DNA sequence and saves it in FASTA-format, as well as extracts gene annotations saves them as five-column tables.

Upon conversion to five-column tables via gbf2tbl.pl, the data can be read by tbl2asn to create Sequin files for direct submission. However, gbf2tbl.pl has considerable limitations, not least in user-friendliness and ability to communicate problems to the user, which is why Lehwark and Greiner (2018) developed a web-based online tool to directly convert GenBank-formated flatfiles into Sequin-formatted submission files.

Teaching in Spring 2019

Teaching.

Usus artium magister, dicendo dicere discunt.“ (Eucharium Eyering, Copiae Proverbium)

Teaching in spring 2019

Teaching in spring 2019

 

 

 

 

 

 

 

 

Extracting mapped R1 and R2 reads from SAM file

Extracting those that mapped

Recently, I found myself in need of extracting only those reads from a sequence alignment map (SAM) file that actually mapped to the reference genome, while maintaining the separation into the paired-end read design. By using a combination of samtools, bedtools and awk, this can be done very efficiently:

INFL=MySamples.sam
STEM=${INF%.sam*}
TMP1=${STEM}.extracted.sam
TMP2=${STEM}.sorted.sam
OTFL=${STEM}.fastq

# Extracting only paired mapped reads
samtools view -b -F12 $INFL > $TMP1
#samtools view -b -F4 $INFL > $TMP1

# Sorting SAM file by read header
# (necessary for the command `bedtools bamtofastq`)
samtools sort -n $TMP1 > $TMP2

# Extracting fastq sequences
bedtools bamtofastq -i $TMP2 -fq $OTFL

# Separating successfully mapped paired reads 
# into an R1 file and an R2 file
cat $OTFL | awk -v n=4 'BEGIN{f=2} { 
  if((NR-1)%n==0){f=1-f}; print > "out_R" f ".fastq"
}'
mv out_R-1.fastq ${STEM}_R1.fastq
mv out_R2.fastq ${STEM}_R2.fastq

 

Setting burn-in and combining posterior tree distributions using awk and sed

Efficiency on the UNIX shell

I often find myself manually removing a set of phylogenetic trees from a posterior tree distribution in order to set a burn-in and then combining the post-burnin trees of the individual runs. This action can be done very efficiently using awk on a UNIX shell:

inf1=Mrbayes_test.run1.t
inf2=Mrbayes_test.run2.t

tmpf1=${inf1%.t*}_postBurnin.tre
tmpf2=${inf2%.t*}_postBurnin.tre
outf=${inf1%.run1.t*}_combined_postBurnin.tre

tac $inf1 | awk '$1 =="tree" && ++counter<=500 
  {print; next} $1 !="tree"' | tac > $tmpf1
tac $inf2 | awk '$1 =="tree" && ++counter<=500 
  {print; next} $1 !="tree"' | tac > $tmpf2
sed '/end;/r'<(grep -vFf "$tmpf1" "$tmpf2") $tmpf1 | 
  grep -v "end;" | sed '$ a\end;\' > $outf

 

New paper – A Python package for DNA sequence submissions

Facilitating DNA sequence submissions to ENA

I just published a new paper on a Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA. Co-author of the software is an undergraduate student, who (as part of his bachelor thesis) designed the graphical user interface for the software.

Teaching botany at night

Get out your cellphones.

Teaching botany at night makes plant identification an interesting challenge. Luckily, today’s students all come equipped with brand-new cellphones (with integrated flashlights). Thus, plant morphology does not need to wait until daylight or the next blooming season.

Teaching botany - Jan 2019

Teaching botany – Jan 2019

 

 

 

 

 

 

 

 

 

 

Morone chrysops at Inks Lake State Park, TX

Notes from the outdoors

Morone chrysops (Moronidae) at Inks Lake State Park, Burnet County, Texas

Morone chrysops (Moronidae) at Inks Lake State Park, Burnet County, Texas (30.7310°, -98.3706°)

 

 

 

 

 

 

 

 

 

 

 

Commandline search interface for ENA

Digging through ENA records.

By using the Python library enasearch, which provides a Python-based interaction with ENA’s API, commandline searches of the ENA databases have become a delight.

Take, for example, the aim to infer the number of individual accessions of the common plant DNA barcoding marker trnK-matK that have become available over the course of the year 2017. This very specific aim can be achieved via the following command:

enasearch search_data \
--query 'organelle=plastid \
AND (first_public>2017-01-01 AND first_public<2018-01-01) \
AND (description="*matK*" OR description="*maturase K*") \
AND (description="*trnK*" OR description="*tRNA-Lys*") \
AND tax_division=PLN AND topology=LINEAR' \
--result sequence_release --display report | wc -l

 

Quick info parsing from GenBank accessions

Taking the essence.

Have you ever found yourself browsing through individual sequence records of the NCBI GenBank database and wishing that you could extract only the metadata information of a record (e.g., authors, publication status, taxonomy), but not the feature table of a record or the sequence itself? With the help of Entrez Direct and awk this is easy.

Take, for example, two complete plastid genomes of Cabomba, which are saved as GenBank accessions MG720558 and MG720559. You can easily extract the metadata in Bash via the following command:

efetch -db nucleotide -format gb -id MG720558,MG720559 | 
awk '/FEATURES/{flag=1} /LOCUS/{flag=0} !flag'