Michael Grünstäudl (Gruenstaeudl), PhD

Postdoctoral Researcher at the Freie Universität Berlin

New Paper – Plastid phylogenomic study on South American sunflowers

DNA sequence alignment matters.

Together with a doctoral student and two other co-authors, I just published a new paper on a plant lineage of South American sunflowers. We demonstrate – among other aspects – that a correct DNA sequence alignment remains a critical aspect in plastid phylogenomic studies, even if the amount of sequence data outweighs classic molecular phylogenetic studies by far.

New Paper – A Python package to evaluate archived plastid genomes

Automated database assessment.

Together with an undergraduate co-author, I just published a new paper on a Python package that automates the survey of thousands of plastid genomes archived on NCBI Nucleotide. Using our new software, we demonstrate – among other aspects – that almost half of all plastid genomes archived on NCBI Nucleotide lack sequence annotations for their inverted repeat regions.

Talk and workshop at Botany 2021

Summer time is conference time.

As part of the joint conference (“Botany 2021 – Virtual!“) of several North American botanical and mycological societies, I am giving two scientific presentations: a regular talk in the topic section on comparative genomics and a workshop presentation.

The workshop has been recorded and can be viewed online here (login to the conference website necessary).

Michael Gruenstaeudl at Botany 2021 – Virtual!

Life can be so easy if you speak grep

Habla usted grep?

In the phylogenetic analyses of a manuscript draft, I accidentally named a species with the specific epithet “violaceae” where it should have been “violacea”. I had consistently used the wrong epithet for years, and countless subdirectories of subdirectories of subdirectories now contain analysis files that include the incorrect name.

How can this snafu be rectified in a reasonable time?

Recursive grep to the rescue …

grep -rl violaceae . | xargs sed -i 's/violaceae/violacea/g'

… and the world is fine again.

Buena vista con mVISTA

For an upcoming publication, a doctoral student and I want to visualize the sequence variability among several plastid genomes via the tool mVISTA. This tool is often employed in plastid phylogenomic studies and generally simple to use. However, if a user wishes to input custom annotations, it can be quite tricky to generate the correct input for the software. Hence, I automated the generation of the input files as described below.

An effective visualization of plastid genomes in mVISTA requires the x input genomes the user wants to visualize (“input genomes” hereafter; e.g., x=2) as well as a reference genome. For simplicity, a user could employ x1 as the reference genome. The input genomes should be in FASTA format, the reference genome in GenBank format.

To generate custom annotations based on the reference genome, I employ the following Bash code, which (a) converts the reference genome from GenBank format to a cleaned GFF3 format (and incidentally also saves the genome in FASTA format), and (b) converts the cleaned GFF3 file to the mVISTA input.

INF=NC_000932.gb # Just an example
## Converting input file from GenBank format to a cleaned GFF3 format
# Note: This step also generates a FASTA file
grep -vE "codon_start|db_xref|exception" $INF > ${INF}2
to-gff --getfasta ${INF}2 ${INF%.gb*}.gff
grep "gene=" ${INF%.gb*}.gff | \
    awk -F';' '{print $1}' | \
    grep -vE "remark|intron|misc_feature|repeat_region" > ${INF%.gb*}.gff.clean
rm ${INF}2 ${INF%.gb*}.gff
## Converting input file from GFF3 format to VISTA format
grep -v "^#" ${INF%.gb*}.gff.clean | \
    grep -v "rps12" | \
    awk '{if ($3 ~ /gene/) {print $7" "$4" "$5" "$3" "$9} else {print $7" "$4" "$5" "$3} }' | \
    awk '{if ($4 ~ /gene/) {gsub(/\+/, ">", $1); gsub(/\-/, "<", $1); print $0} else {$1=""; print $0} }' | \
    sed 's/gene=//' | \
    sed 's/gene/agene/' | \
    sort -n -k2 | \
    sed 's/agene/gene/' | \
    awk '{$1=$1}1' | \
    sed 's/CDS/exon/' | \
    sed 's/tRNA/utr/' | \
    sed 's/rRNA/utr/' > ${INF%.gb*}.mvista
mVISTA example

mVISTA example

Quick Illumina read statistics in Bash

Recently, a Master’s student of mine asked me to re-calculate some data statistics of an Illumina sequencing run. Among the desired statistics were (a) the total number of read bases (in bp), (b) the total number of reads, (c) the GC content (in %), and (d) the AT content (in %).

Since the average file size of this run was almost a dozen GB per sample, I wrote some Bash code to can calculate these data statistics within minutes.


#Total read bases (bp):
AllBases=$(grep -A1 "^@" --no-group-separator ${SAMPLE}_*.fastq | \
  grep -v "@" | sed "s|$SAMPLE\_.\.fastq-||g" | wc -m)
echo "$AllBases"

#Total reads:
grep "^@" ${SAMPLE}_*.fastq | wc -l

GandCs=$(grep -A1 "^@" --no-group-separator ${SAMPLE}_*.fastq | \
  grep -v "@" | sed "s|$SAMPLE\_.\.fastq-||g" | tr -cd GC | wc -m)
echo "scale=4;  $GandCs / $AllBases" | bc

AandTs=$(grep -A1 "^@" --no-group-separator ${SAMPLE}_*.fastq | \
  grep -v "@" | sed "s|$SAMPLE\_.\.fastq-||g" | tr -cd AT | wc -m)
echo "scale=4;  $AandTs / $AllBases" | bc

Simple indel coding “for to go”

The process of coding insertions and deletions in a multi-species DNA alignment such that the presence of these indels can be used as additional information in a phylogenetic investigation is typically referred to as “indel coding”. Multiple indel coding schemes exist, with a scheme proposed by Simmons and Ochoterena (2000; often referred to as “simple indel coding scheme”) representing the most widely employed scheme.

Phylogenomic datasets require users to conduct indel coding for dozens, if not hundreds of DNA alignments. However, automating the process of indel coding based on the simple indel coding scheme is fairly straightforward:

We use FHCRC’s Seqmagick to convert NEXUS-formatted alignments to FASTA-formatted alignments, the perl script 2matrix (Salinas and Little 2014) to conduct the indel coding, and then simply conduct some file hygiene afterward.

# Convert NEXUS to FASTA
seqmagick convert $INF ${INF%.nex*}.fasta
# Conduct indel coding
perl /path_to_git/2matrix/2matrix.pl -i ${INF%.nex*}.fasta \
-o n,p -n ${INF%.nex*}__SIC # File hygiene mv ${INF%.nex*}__SIC.part ${INF%.nex*}__SIC.part.txt rm ${INF%.nex*}__SIC.garli.nex ${INF%.nex*}__SIC.conf

Automatically renaming contigs of assembly results

The genome assembly process often generates FASTA-formatted contig files, in which the contigs have cryptic sequence names. By using specific Bash commands, one can automatically rename these contigs based on the name of the file they are contained in.

If your contig file contains only a single contig:

for i in *__contig.fasta; do 
  sed -i "1s/.*/>$VAR/" $i; 

If your contig file contains multiple contigs:

for i in *__contigs.fasta; do 
  sed -i "s/>.*/>$VAR/" $i;
  awk '$0=$0' $i | awk '!(NR%2){print prev "__Length" length($0) ORS $0} {prev=$0}' > ${i}.new

New Paper – An R package for visualizing plastome coverage depth

Visualizing the depth

I just published a new paper on an R package that assists users in visualizing the coverage depth of a plastid genome in relation to its circular, quadripartite genome structure. Co-author of the software is an undergraduate student, who (as part of his bachelor thesis) designed the visualization engine of the software.

Teaching botany during the Coronavirus pandemic

Time for innovative methods.

Teaching botany and evolutionary biology to college students during spring 2020 certainly required the application of innovative teaching methods. Classes are held entirely online, and access to suitable plant material is limited by the closure of university greenhouses. Still, with the proper adjustment of the course structure, a meaningful learning experience is possible.

Teaching in May 2020

Teaching in May 2020