How to call the anticodon?
Over the past few days I have been correcting genomic annotations using custom bash and Python code. One of the more interesting exercises has been the homogenization of the “product” tags for transfer RNAs, which provide information about the respective anticodon sequences.
In the majority of databases familiar to me, anticodons are sometimes indicated by their DNA (e.g., transfer RNA-Ile (GAT)), sometimes by their RNA sequence (e.g., transfer RNA-Ile (GAU)). I have yet to see a rule as to which version ought to be used.
In order to homogenize the spelling, I wrote a few lines of bash code. Interestingly, this coding problem is not a sed one-liner, but requires some intricate awk command (please see this Stackoverflow discussion for details).
Here is the solution I eventually adopted:
echo -e "CompleteAssembly maker gene 1859 4482 . - . Name=trnK-UUU" > tmp
awk -v kw='trn'-v pos=5 'p=index($0, kw) {n=p+length(kw)+1; s=substr($0, n, pos); gsub(/U/, "T", s); $0=substr($0, 1, n-1) s substr($0, n+pos)} 1' tmp
CompleteAssembly maker gene 1859 4482 . – . Name=trnK-TTT