Priming for success
It has been several years since I last developed a pair of customized oligonucleotide primers for DNA sequencing. At that time I tended to operate software via GUIs, not knowing that an entire toolkit of commandline tools exists, which can get the job done faster and more efficiently. Today I have an opportunity to explore this toolkit.
Assume that you have (a) a tab-delimited database of oligonucleotide primers (e.g., primer_database.tsv), and (b) a fasta-formatted DNA alignment of a target organism (e.g., target_alignment.fas). Your primer database contains hundreds of primers, each occupying one line and followed (on the same line) by information regarding the target gene (e.g., matK) or the strand orientation (e.g., R for reverse). How can you identify those primers within your database that will likely bind to one or more of the target organism?
First, you extract your primers of interest via grep and awk.
cat primer_database.tsv | grep matK | grep $'\tF\t'
| awk -F '\t' '{print $2, $3}' > matK_forw.tsv
Second, you loop through your primers of interest and match them to the target alignment via fqgrep, allowing a specified number of mismatches.
for i in $(cat matK_forw.tsv | awk '{print $2}'); do fqgrep -m 1
-p $i -r target_alignment.fas >> forward_oneMismatch.txt; done
Third, you reduce the resulting list of matches so that extraneous information is filtered out and all matches are combined into a single text file (e.g., combined.txt).
for i in $(ls *Mismatch.txt); do echo $i >> combined.txt; cat $i
| awk -F '\t' '{print $1, $3, $7, $8, $9}' >> combined.txt;
echo "\n" >> combined.txt; done
You now have a list of primers that match the target sequences and can evaluate other factors relevant to the design of oligonucleotide primers, such as amplicon length, melting and annealing temperatures, and hairpin and dimer formation.
P.S. Special thanks goes to my colleague Nadja K. for reminding me of relevant issues involved in primer design.