Michael Grünstäudl (Gruenstaeudl), PhD

Researcher at the Freie Universität Berlin

New paper – Bioinformatic Workflows for Generating Complete Plastid Genome Sequences

In science, standardization and repeatability is a must.

Together with two other scientists, I just published a new paper on bioinformatic workflows for generating complete plastid genome sequences in the context of plastid phylogenomics of the water-lily clade. We demonstrate that standardization and repeatability are essential elements for modern plant phylogenomics and how such standardization and repeatability can be achieved efficiently during plastid genome assembly, annotation and alignment.


One-liner: Splitting multi-sequence FASTA into single-sequence FASTA

Quick, split it!
There are one-liners that never get old. Here is another one of them.

$ csplit multisequence.fasta /\>/ {*} && 
find . -size  0 -print0 |xargs -0 rm --

Teaching in spring 2018 – Part II

Teaching in spring 2018















Teaching in spring 2018 – Part I

In the lab with teachers-to-be

One of the master-level courses I teach at the Freie Universität Berlin during spring semester, the NatLab Evolution, is geared towards a comprehensive teaching education for upcoming high-school teachers.

Teaching in spring 2018

Teaching in spring 2018











One-liner: Interleaved to deinterleaved FASTA

Quick, de-interleave it!
There are one-liners that never get old. This is one of them.

$ perl -MBio::SeqIO -e 
'my $seqin = Bio::SeqIO->
 new(-fh => \*STDIN, -format => 'fasta');
 while (my $seq = $seqin-> next_seq)
 { print ">", $seq-> id, "\n", $seq-> seq, "\n"; }'
< interleaved.fasta > deinterleaved.fasta


Update 28-Jan-2019:
Over time, I came to find working with BioPerl uncomfortable, as its clean installation is just not well-supported on Linux. Thus, I have found myself relying on this method more and more, assuming that the line breaks of the input file are LF (and not CRLF):

$ awk '/^>/ {printf("\n%s\n",$0);next; } 
{ printf("%s",$0);}  END {printf("\n");}' 
< interleaved.fast | tail -n +2 
> deinterleaved.fasta

Few-liner: Batch download of DNA sequences from NCBI

The wonders of entrez

Today I found myself in need of a script to download dozens of DNA sequences submitted to NCBI Nucleotide. The sequences in questeion were stores in file input.txt.

$ cat input.txt

Here is how I did it:

$ INF=input.txt
$ for line in $(cat $INF); do
    SEQNAME=$(echo "$line" | awk -F',' '{print $1}')
    ACCNUM=$(echo "$line" | awk -F',' '{print $2}')
    FULLNAM=$(echo ">${SEQNAME}_${ACCNUM}")
    SEQ=$(esearch -db nucleotide -query "$ACCNUM" | efetch -format fasta | tail -n +2)
    echo -e "$FULLNAM\n$SEQ" >> out.txt

Bioinformatic spring cleaning – Part II

An improved few-liner to keep the data compressed

If you wish to recusively loop through a folder and its nested subfolders and automatically gzip all files greater than 1 GB, the following few-liner is for you:

for file in $(LANG=C find . -size +1G -type f -print); do
    if [[ ! $file == *.gzip ]]; then
    gzip $file

Bioinformatic spring cleaning – Part I

A short one-liner to keep the data compressed

One of the bash one-liners that I use after every successful project, yet never remember when needed, is for the simple task of looping through your folders, tar-zipping them and then removing the original folders.

for i in $(ls -d */); do
    tar czf ${i%%/}.tar.gz $i && rm -r $i;

First signs of spring 2018

Galanthus nivalis (Amaryllidaceae) in Berlin in March 2018











Workshop at the GfBS 2018

Talking about efficient data partitioning strategies

Today, I held a workshop at the 19th annual meeting of the Gesellschaft für Biologische Systematik (GfBS). I introduced the participants to computational strategies to automate the selection of data partitioning schemes and nucleotide substitution models for phylogenomic datasets.

Workshop at the GfBS 2018