20. August 2019 von Michael Grünstäudl
Ramblings on an important topic
A series of software tools exist that allow users to conduct submissions of DNA sequences to NCBI GenBank, but file conversion represents a recurring challenge for those submissions. Similar to DNA sequence submissions to ENA, GenBank provides a wide range of options to upload annotated DNA sequences in a custom format, including an interactive submission portal, a web-based tool (BankIt) and two stand-alone software solutions (Sequin, tbl2asn), the latter of which operates on a command-line basis.
However, for several of these tools (e.g., BankIt, Sequin, tbl2asn), the user must provide tab-delimited data tables (“five-column tables”) in order to provide information regarding gene annotations. Five-column tables display a complex syntax and, thus, require considerable bioinformatic knowledge to generate. The methodological gap between GenBank-formatted flatfiles and Sequin-formatted submission files is partially bridged by the perl-script gbf2tbl.pl, which takes GenBank-formatted flatfiles as input, extracts the DNA sequence and saves it in FASTA-format, as well as extracts gene annotations saves them as five-column tables.
Upon conversion to five-column tables via gbf2tbl.pl, the data can be read by tbl2asn to create Sequin files for direct submission. However, gbf2tbl.pl has considerable limitations, not least in user-friendliness and ability to communicate problems to the user, which is why Lehwark and Greiner (2018) developed a web-based online tool to directly convert GenBank-formated flatfiles into Sequin-formatted submission files.
Kategorie bioinformatics, writing | 0 Kommentare »
5. June 2019 von Michael Grünstäudl
Teaching.
“Usus artium magister, dicendo dicere discunt.” (Eucharium Eyering, Copiae Proverbium)
Teaching in spring 2019
Kategorie Allgemein | 0 Kommentare »
5. June 2019 von Michael Grünstäudl
Extracting those that mapped
Recently, I found myself in need of extracting only those reads from a sequence alignment map (SAM) file that actually mapped to the reference genome, while maintaining the separation into the paired-end read design. By using a combination of samtools, bedtools and awk, this can be done very efficiently:
INFL=MySamples.sam
STEM=${INF%.sam*}
TMP1=${STEM}.extracted.sam
TMP2=${STEM}.sorted.sam
OTFL=${STEM}.fastq
# Extracting only paired mapped reads
samtools view -b -F12 $INFL > $TMP1
#samtools view -b -F4 $INFL > $TMP1
# Sorting SAM file by read header
# (necessary for the command `bedtools bamtofastq`)
samtools sort -n $TMP1 > $TMP2
# Extracting fastq sequences
bedtools bamtofastq -i $TMP2 -fq $OTFL
# Separating successfully mapped paired reads
# into an R1 file and an R2 file
cat $OTFL | awk -v n=4 'BEGIN{f=2} {
if((NR-1)%n==0){f=1-f}; print > "out_R" f ".fastq"
}'
mv out_R-1.fastq ${STEM}_R1.fastq
mv out_R2.fastq ${STEM}_R2.fastq
Kategorie bioinformatics | 0 Kommentare »
7. February 2019 von Michael Grünstäudl
Efficiency on the UNIX shell
I often find myself manually removing a set of phylogenetic trees from a posterior tree distribution in order to set a burn-in and then combining the post-burnin trees of the individual runs. This action can be done very efficiently using awk on a UNIX shell:
inf1=Mrbayes_test.run1.t
inf2=Mrbayes_test.run2.t
tmpf1=${inf1%.t*}_postBurnin.tre
tmpf2=${inf2%.t*}_postBurnin.tre
outf=${inf1%.run1.t*}_combined_postBurnin.tre
tac $inf1 | awk '$1 =="tree" && ++counter<=500
{print; next} $1 !="tree"' | tac > $tmpf1
tac $inf2 | awk '$1 =="tree" && ++counter<=500
{print; next} $1 !="tree"' | tac > $tmpf2
sed '/end;/r'<(grep -vFf "$tmpf1" "$tmpf2") $tmpf1 |
grep -v "end;" | sed '$ a\end;\' > $outf
Kategorie bioinformatics, one-liners | 0 Kommentare »
11. January 2019 von Michael Grünstäudl
Facilitating DNA sequence submissions to ENA
I just published a new paper on a Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA. Co-author of the software is an undergraduate student, who (as part of his bachelor thesis) designed the graphical user interface for the software.
Kategorie scientific papers | 0 Kommentare »
8. January 2019 von Michael Grünstäudl
Get out your cellphones.
Teaching botany at night makes plant identification an interesting challenge. Luckily, today’s students all come equipped with brand-new cellphones (with integrated flashlights). Thus, plant morphology does not need to wait until daylight or the next blooming season.
Teaching botany – Jan 2019
Kategorie teaching | 0 Kommentare »
30. December 2018 von Michael Grünstäudl
Notes from the outdoors
Morone chrysops (Moronidae) at Inks Lake State Park, Burnet County, Texas (30.7310°, -98.3706°)
Kategorie myself | 0 Kommentare »
8. November 2018 von Michael Grünstäudl
Digging through ENA records.
By using the Python library enasearch, which provides a Python-based interaction with ENA’s API, commandline searches of the ENA databases have become a delight.
Take, for example, the aim to infer the number of individual accessions of the common plant DNA barcoding marker trnK-matK that have become available over the course of the year 2017. This very specific aim can be achieved via the following command:
enasearch search_data \
--query 'organelle=plastid \
AND (first_public>2017-01-01 AND first_public<2018-01-01) \
AND (description="*matK*" OR description="*maturase K*") \
AND (description="*trnK*" OR description="*tRNA-Lys*") \
AND tax_division=PLN AND topology=LINEAR' \
--result sequence_release – display report | wc -l
Kategorie Allgemein | 0 Kommentare »
1. November 2018 von Michael Grünstäudl
Taking the essence.
Have you ever found yourself browsing through individual sequence records of the NCBI GenBank database and wishing that you could extract only the metadata information of a record (e.g., authors, publication status, taxonomy), but not the feature table of a record or the sequence itself? With the help of Entrez Direct and awk this is easy.
Take, for example, two complete plastid genomes of Cabomba, which are saved as GenBank accessions MG720558 and MG720559. You can easily extract the metadata in Bash via the following command:
efetch -db nucleotide -format gb -id MG720558,MG720559 |
awk '/FEATURES/{flag=1} /LOCUS/{flag=0} !flag'
Kategorie bioinformatics, one-liners | 0 Kommentare »
1. October 2018 von Michael Grünstäudl
Alpine plants in the mist
An alpine field trip in late September can mean cold temperatures and lots of foggy weather. Interesting plants abound nonetheless.
Rote Wand at Mt. Dobratsch |
Carduus defloratus (Asteraceae) |
Michael Gruenstaeudl in September 2018 |
Kategorie Allgemein | 1 Kommentar »