Michael Grünstäudl (Gruenstaeudl), PhD

Researcher at the Freie Universität Berlin

Setting burn-in and combining posterior tree distributions using awk and sed

Efficiency on the UNIX shell

I often find myself manually removing a set of phylogenetic trees from a posterior tree distribution in order to set a burn-in and then combining the post-burnin trees of the individual runs. This action can be done very efficiently using awk on a UNIX shell:

inf1=Mrbayes_test.run1.t
inf2=Mrbayes_test.run2.t

tmpf1=${inf1%.t*}_postBurnin.tre
tmpf2=${inf2%.t*}_postBurnin.tre
outf=${inf1%.run1.t*}_combined_postBurnin.tre

tac $inf1 | awk '$1 =="tree" && ++counter<=500 
  {print; next} $1 !="tree"' | tac > $tmpf1
tac $inf2 | awk '$1 =="tree" && ++counter<=500 
  {print; next} $1 !="tree"' | tac > $tmpf2
sed '/end;/r'<(grep -vFf "$tmpf1" "$tmpf2") $tmpf1 | 
  grep -v "end;" | sed '$ a\end;\' > $outf

 

New paper – A Python package for DNA sequence submissions

Facilitating DNA sequence submissions to ENA

I just published a new paper on a Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA. Co-author of the software is an undergraduate student, who (as part of his bachelor thesis) designed the graphical user interface for the software.

Teaching botany at night

Get out your cellphones.

Teaching botany at night makes plant identification an interesting challenge. Luckily, today’s students all come equipped with brand-new cellphones (with integrated flashlights). Thus, plant morphology does not need to wait until daylight or the next blooming season.

Teaching botany - Jan 2019

Teaching botany – Jan 2019

 

 

 

 

 

 

 

 

 

 

Morone chrysops at Inks Lake State Park, TX

Notes from the outdoors

Morone chrysops (Moronidae) at Inks Lake State Park, Burnet County, Texas

Morone chrysops (Moronidae) at Inks Lake State Park, Burnet County, Texas (30.7310°, -98.3706°)

 

 

 

 

 

 

 

 

 

 

 

Commandline search interface for ENA

Digging through ENA records.

By using the Python library enasearch, which provides a Python-based interaction with ENA’s API, commandline searches of the ENA databases have become a delight.

Take, for example, the aim to infer the number of individual accessions of the common plant DNA barcoding marker trnK-matK that have become available over the course of the year 2017. This very specific aim can be achieved via the following command:

enasearch search_data \
--query 'organelle=plastid \
AND (first_public>2017-01-01 AND first_public<2018-01-01) \
AND (description="*matK*" OR description="*maturase K*") \
AND (description="*trnK*" OR description="*tRNA-Lys*") \
AND tax_division=PLN AND topology=LINEAR' \
--result sequence_release --display report | wc -l

 

Quick info parsing from GenBank accessions

Taking the essence.

Have you ever found yourself browsing through individual sequence records of the NCBI GenBank database and wishing that you could extract only the metadata information of a record (e.g., authors, publication status, taxonomy), but not the feature table of a record or the sequence itself? With the help of Entrez Direct and awk this is easy.

Take, for example, two complete plastid genomes of Cabomba, which are saved as GenBank accessions MG720558 and MG720559. You can easily extract the metadata in Bash via the following command:

efetch -db nucleotide -format gb -id MG720558,MG720559 | 
awk '/FEATURES/{flag=1} /LOCUS/{flag=0} !flag'

Botanical field trip in the Carinthian Alps

Alpine plants in the mist

An alpine field trip in late September can mean cold temperatures and lots of foggy weather. Interesting plants abound nonetheless.

Rote Wand at Mt. Dobratsch

Rote Wand at Mt. Dobratsch

Carduus defloratus (Asteraceae)

Carduus defloratus (Asteraceae)

Michael Gruenstaeudl in September 2018

Michael Gruenstaeudl in September 2018

Talk at evolutionary plant biology conference

Talking about novel bioinformatic tools for DNA sequence submissions

This Thursday, I held a conference talk at the 24th International Symposium on Biodiversity and Evolutionary Biology of the German Botanical Society. I introduced the participants to some of my newly developed tools for streamlining and automating the submssion of plant DNA barcoding sequences to public sequence repositories. This conference was a wonderful example of how small conferences can both meet high scientific standards and be an enjoyable reprieve for the participants. Lots of interested talks and a great social programme among the gorgeous Carinthian scenery!

Talk at DBG Sektionstagung 2018 in Klagenfurt, Kaertnen

Talk at DBG Sektionstagung 2018 in Klagenfurt, Kaertnen

 

 

 

 

 

 

 

 

 

 

New paper – Bioinformatic Workflows for Generating Complete Plastid Genome Sequences

In science, standardization and repeatability is a must.

Together with two other scientists, I just published a new paper on bioinformatic workflows for generating complete plastid genome sequences in the context of plastid phylogenomics of the water-lily clade. We demonstrate that standardization and repeatability are essential elements for modern plant phylogenomics and how such standardization and repeatability can be achieved efficiently during plastid genome assembly, annotation and alignment.

 

One-liner: Splitting multi-sequence FASTA into single-sequence FASTA

Quick, split it!
There are one-liners that never get old. Here is another one of them.

$ csplit multisequence.fasta /\>/ {*} && 
find . -size  0 -print0 |xargs -0 rm --