Alumnus of Freie Universität Berlin – Michael Grünstäudl, PhD

Successful habilitation in botany and bioinformatics

Buena vista con mVISTA

For an upcoming publication, a doctoral student and I want to visualize the sequence variability among several plastid genomes via the tool mVISTA. This tool is often employed in plastid phylogenomic studies and generally simple to use. However, if a user wishes to input custom annotations, it can be quite tricky to generate the correct input for the software. Hence, I automated the generation of the input files as described below.

An effective visualization of plastid genomes in mVISTA requires the x input genomes the user wants to visualize (“input genomes” hereafter; e.g., x=2) as well as a reference genome. For simplicity, a user could employ x1 as the reference genome. The input genomes should be in FASTA format, the reference genome in GenBank format.

To generate custom annotations based on the reference genome, I employ the following Bash code, which (a) converts the reference genome from GenBank format to a cleaned GFF3 format (and incidentally also saves the genome in FASTA format), and (b) converts the cleaned GFF3 file to the mVISTA input.

INF=NC_000932.gb # Just an example
## Converting input file from GenBank format to a cleaned GFF3 format
# Note: This step also generates a FASTA file
grep -vE "codon_start|db_xref|exception" $INF > ${INF}2
to-gff – getfasta ${INF}2 ${INF%.gb*}.gff
grep "gene=" ${INF%.gb*}.gff | \
    awk -F';' '{print $1}' | \
    grep -vE "remark|intron|misc_feature|repeat_region" > ${INF%.gb*}.gff.clean
rm ${INF}2 ${INF%.gb*}.gff
## Converting input file from GFF3 format to VISTA format
grep -v "^#" ${INF%.gb*}.gff.clean | \
    grep -v "rps12" | \
    awk '{if ($3 ~ /gene/) {print $7" "$4" "$5" "$3" "$9} else {print $7" "$4" "$5" "$3} }' | \
    awk '{if ($4 ~ /gene/) {gsub(/\+/, ">", $1); gsub(/\-/, "<", $1); print $0} else {$1=""; print $0} }' | \
    sed 's/gene=//' | \
    sed 's/gene/agene/' | \
    sort -n -k2 | \
    sed 's/agene/gene/' | \
    awk '{$1=$1}1' | \
    sed 's/CDS/exon/' | \
    sed 's/tRNA/utr/' | \
    sed 's/rRNA/utr/' > ${INF%.gb*}.mvista
mVISTA example

mVISTA example

Der Beitrag wurde am Monday, den 12. October 2020 um 17:41 Uhr von Michael Grünstäudl veröffentlicht und wurde unter bioinformatics abgelegt. Sie können die Kommentare zu diesem Eintrag durch den RSS 2.0 Feed verfolgen. Sie können einen Kommentar schreiben, oder einen Trackback auf Ihrer Seite einrichten.

7 Reaktionen zu “Buena vista con mVISTA”

  1. Mergi Dinka

    I was trying to use your script to convert my gb file to gff for mvista format but it didn’t work for me especially this script gives error continuously. what was the case? could you explain me the reason and how should i use please?
    Best
    Mergi

  2. Mergi Dinka

    I was trying to use your script to convert my gb file to gff for mvista format but it didn’t work for me especially this script “to-gff – getfasta ${INF}2 ${INF%.gb*}.gff” gives error continuously. what was the case? could you explain me the reason and how should i use please?
    Best
    Mergi

  3. Hassan Rasouli

    I have a similar problem. May I kindly ask you to let me know how I can solve this error?

    to-gff: command not found

  4. Andres Villavicencio

    To anyone getting the error with to-gff, I think this might help:

    You have to get the to-gff tool from github:
    https://github.com/mscook/to-gff/tree/master

    Then you have to install it using python 2, it does not work with python 3.
    Since python 2 is deprecated, you can use virtualenv or any other tool to use python 2 instead of python 3.

    The line in the example has to be corrected as well, you need to use:
    to-gff –getfasta ${INF}2 ${INF%.gb*}.gff

    This should solve the problem.

  5. Arief Priyadi

    I got the same problem that to-gff: command not found. I substituted this step by converting gb to gff by using NCBI Genome Workbench. Finally, I succeeded in generating the mvista file.

  6. Hassan Rasouli

    Thank you, Arief. Your method works well.

    In simple terms, I converted “.gb” files into “.gff” format using the NCBI Genome Workbench. You can download this software here: https://www.ncbi.nlm.nih.gov/tools/gbench/downloads/

    To do this, first, download your target sequences using GenBank accession IDs in NGW and then export your data in “.gff” format.

    Secondly, disregard the first part of this script and modify the content as shown below:

    ## Converting input file from GFF3 format to VISTA format
    INF=ID.gb # Just an example

    grep -v “^#” “${INF%.gb*}.gff.clean” | \
    grep -v “rps12″ | \
    awk ‘{if ($3 ~ /gene/) {print $7” “$4” “$5” “$3” “$9} else {print $7” “$4” “$5” “$3} }’ | \
    awk ‘{if ($4 ~ /gene/) {gsub(/\+/, “>”, $1); gsub(/\-/, ” “${INF%.gb*}.mvista”

    Now you can use the generated outputs from this script to annotate your mvista alignment. Please note that you should manually modify gene names in the generated outputs to align with your gene names above the related locus in the final annotation represented in the output of mvista.

  7. Alan Thomas S

    # create new environment and install bcbio-gff
    conda create -n mvista
    conda activate mvista
    conda install bioconda::bcbio-gff

    # install github and to-gff
    sudo apt install git
    git clone https://github.com/mscook/to-gff.git
    cd to-gff
    pip install .

    you should give two hyphen before getfasta. to-gff –getfasta ${INF}2 ${INF%.gb*}.gff

Leave a Reply

Captcha
Refresh
Hilfe
Hinweis / Hint
Das Captcha kann Kleinbuchstaben, Ziffern und die Sonderzeichzeichen »?!#%&« enthalten.
The captcha could contain lower case, numeric characters and special characters as »!#%&«.