A post without spello in the title
Today, I attempted to blast entire chloroplast genomes against NCBI’s nucleotide database via the BLASTn command-line tool. Since typical plastomes are between 150,000 and 160,000 bp in length, BLASTn searches that are conducted remotely take approximately 20 min. on average.
time blastn -db nt -query myinputseq.fasta -remote -out results.txt
real 21m25.189s
user 0m0.070s
sys 0m0.010s
Can we speed up such searches by splitting the input query sequence into equally-sized, smaller pieces and blasting each piece separately?
# Splitting input query sequence into ten equally-sized, smaller pieces
INF=myinputseq.fasta
split -d -b $(bc <<< $(tail -n1 $INF | wc -c)/10) $INF prt
# Blasting each region against NCBI’s nucleotide database
for i in $(ls prt*); do
echo $i >> results.txt;
time blastn -db nt -query $i -remote -outfmt '7 length pident sscinames' -max_target_seqs 10 -out $i.result >> time.txt;
rm $i;
done
real 0m15.161s
user 0m0.060s
sys 0m0.010s
real 0m28.901s
user 0m0.060s
sys 0m0.013s
real 0m27.328s
user 0m0.057s
sys 0m0.013s
real 0m14.909s
user 0m0.067s
sys 0m0.010s
real 0m13.689s
user 0m0.043s
sys 0m0.023s
etc.
Yes, we can! (However, I bet that the above scenario was a lucky incidence, and that the time improvement caused by reducing the query sequence size is not always that pronounced).