## Massive plastid genome sequencing coincides with incomplete annotations of the inverted repeats

High numbers do not equate with high quality

Over the past 10 years, the number of complete plastid genome sequences available on NCBI GenBank has skyrocketed, especially for flowering plants. Whereas in December 2009, approximately 120 complete plastid genomes of flowering plants were present on GenBank, there are 5,838 complete plastid genomes of flowering plants available at the end of November 2019. That is an increase of 4,765% in 10 years (or roughly a quinquaginta-nupling of all plastid genomes available since 2009 [yes, I had Latin in school]).

However, it appears that the annotation quality of many of the genomes submitted to GenBank over the past decade remained less than stellar during this increase. The inverted repeat (IR) regions of a plastid genome are integral parts of virtually any flowering plant plastome, and their presence in the sequence annotations are typically a good proxy for the annotation quality. While in December 2009, approximately 95 (or approx. 80%) of the available plastid genomes of flowering plants did not contain unambiguous annotations for the IRs, that number has increased to 3,190 (or approx. 55%) at the end of November 2019.

Example of IR annotation (highlighted in grey) in GenBank flat file of a complete plastid genome

In short, despite a dramatic increase in complete plastid genomes of flowering plants available on GenBank, more than half of all those genomes still do not have appropriate sequence annotations of the IR regions of the plastid genome. This is quite concerning, not least because IR annotations are very simple to include in a genome file (see figure). Clearly, improved strategies for annotating (or correcting) complete plastid genomes are necessary.

## Talks at public events

One of the by-products of being a scientist: Talks at public events

Michael Gruenstaeudl at VictoriaTalk2018

## Improved sorting of numbered DNA sequences

Keeping things orderly

Like many other molecular phylogeneticits, I often work with massive numbers of FASTA-formatted DNA sequences. Occasionally, the names of these sequences are numbered in a simplistic fashion (i.e., 1,2,…,48,49), which has the unfortunate side-effect of messing up the intended order of the sequences when sorted numerically, as the sequences 10, 11, …, 19 are followed (and not preceded!) by 1.

To overcome this issue, it is necessary to prepend all sequence names of numbers shorter than the maximum number of digits with one or more leading zeros. The resulting sequence names would, thus, be: 01, 02, …, 48, 49.

Here is a quick Python script that performs the prepending of leading zeros (assuming the FASTA-specific larger-signs are in different lines than the sequence names):

#!/usr/bin/env python2.7
import sys
inFn = sys.argv[1]
delim = '_'
with open(inFn, 'r') as reader, \
for line in lines:
if delim in line and line[0].isdigit():
lparts = [e+delim for e in \
line.split(delim) if e]
lparts[-1] = lparts[-1].rstrip(delim)
if len(lparts[0]) <= 3:
line = lparts[0].rjust(4, '0') + \
''.join(lparts[1:])
writer.write(line)
else:
writer.write(line)


## On the issue of file formats during DNA sequence submissions to GenBank

Ramblings on an important topic

A series of software tools exist that allow users to conduct submissions of DNA sequences to NCBI GenBank, but file conversion represents a recurring challenge for those submissions. Similar to DNA sequence submissions to ENA, GenBank provides a wide range of options to upload annotated DNA sequences in a custom format, including an interactive submission portal, a web-based tool (BankIt) and two stand-alone software solutions (Sequin, tbl2asn), the latter of which operates on a command-line basis.

However, for several of these tools (e.g., BankIt, Sequin, tbl2asn), the user must provide tab-delimited data tables (“five-column tables”) in order to provide information regarding gene annotations. Five-column tables display a complex syntax and, thus, require considerable bioinformatic knowledge to generate. The methodological gap between GenBank-formatted flatfiles and Sequin-formatted submission files is partially bridged by the perl-script gbf2tbl.pl, which takes GenBank-formatted flatfiles as input, extracts the DNA sequence and saves it in FASTA-format, as well as extracts gene annotations saves them as five-column tables.

Upon conversion to five-column tables via gbf2tbl.pl, the data can be read by tbl2asn to create Sequin files for direct submission. However, gbf2tbl.pl has considerable limitations, not least in user-friendliness and ability to communicate problems to the user, which is why Lehwark and Greiner (2018) developed a web-based online tool to directly convert GenBank-formated flatfiles into Sequin-formatted submission files.

## Teaching in Spring 2019

Teaching.

Usus artium magister, dicendo dicere discunt.“ (Eucharium Eyering, Copiae Proverbium)

Teaching in spring 2019

## Extracting mapped R1 and R2 reads from SAM file

Extracting those that mapped

Recently, I found myself in need of extracting only those reads from a sequence alignment map (SAM) file that actually mapped to the reference genome, while maintaining the separation into the paired-end read design. By using a combination of samtools, bedtools and awk, this can be done very efficiently:

INFL=MySamples.sam
STEM=${INF%.sam*} TMP1=${STEM}.extracted.sam
TMP2=${STEM}.sorted.sam OTFL=${STEM}.fastq

# Extracting only paired mapped reads
samtools view -b -F12 $INFL >$TMP1
#samtools view -b -F4 $INFL >$TMP1

# (necessary for the command bedtools bamtofastq)
samtools sort -n $TMP1 >$TMP2

# Extracting fastq sequences
bedtools bamtofastq -i $TMP2 -fq$OTFL

# Separating successfully mapped paired reads
# into an R1 file and an R2 file
cat $OTFL | awk -v n=4 'BEGIN{f=2} { if((NR-1)%n==0){f=1-f}; print > "out_R" f ".fastq" }' mv out_R-1.fastq${STEM}_R1.fastq
mv out_R2.fastq ${STEM}_R2.fastq  ## Setting burn-in and combining posterior tree distributions using awk and sed Efficiency on the UNIX shell I often find myself manually removing a set of phylogenetic trees from a posterior tree distribution in order to set a burn-in and then combining the post-burnin trees of the individual runs. This action can be done very efficiently using awk on a UNIX shell: inf1=Mrbayes_test.run1.t inf2=Mrbayes_test.run2.t tmpf1=${inf1%.t*}_postBurnin.tre
tmpf2=${inf2%.t*}_postBurnin.tre outf=${inf1%.run1.t*}_combined_postBurnin.tre

tac $inf1 | awk '$1 =="tree" && ++counter<=500
{print; next} $1 !="tree"' | tac >$tmpf1
tac $inf2 | awk '$1 =="tree" && ++counter<=500
{print; next} $1 !="tree"' | tac >$tmpf2
sed '/end;/r'<(grep -vFf "$tmpf1" "$tmpf2") $tmpf1 | grep -v "end;" | sed '$ a\end;\' > \$outf


## New paper – A Python package for DNA sequence submissions

Facilitating DNA sequence submissions to ENA

I just published a new paper on a Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA. Co-author of the software is an undergraduate student, who (as part of his bachelor thesis) designed the graphical user interface for the software.

## Teaching botany at night

Teaching botany at night makes plant identification an interesting challenge. Luckily, today’s students all come equipped with brand-new cellphones (with integrated flashlights). Thus, plant morphology does not need to wait until daylight or the next blooming season.

Teaching botany – Jan 2019

## Morone chrysops at Inks Lake State Park, TX

Notes from the outdoors

Morone chrysops (Moronidae) at Inks Lake State Park, Burnet County, Texas (30.7310°, -98.3706°)