Recently, a Master’s student of mine asked me to re-calculate some data statistics of an Illumina sequencing run. Among the desired statistics were (a) the total number of read bases (in bp), (b) the total number of reads, (c) the GC content (in %), and (d) the AT content (in %).
Since the average file size of this run was almost a dozen GB per sample, I wrote some Bash code to can calculate these data statistics within minutes.
SAMPLE=mySampleName #Total read bases (bp): AllBases=$(grep -A1 "^@" – no-group-separator ${SAMPLE}_*.fastq | \ grep -v "@" | sed "s|$SAMPLE\_.\.fastq-||g" | wc -m) echo "$AllBases" #Total reads: grep "^@" ${SAMPLE}_*.fastq | wc -l #GC(%): GandCs=$(grep -A1 "^@" – no-group-separator ${SAMPLE}_*.fastq | \ grep -v "@" | sed "s|$SAMPLE\_.\.fastq-||g" | tr -cd GC | wc -m) echo "scale=4; $GandCs / $AllBases" | bc #AT(%): AandTs=$(grep -A1 "^@" – no-group-separator ${SAMPLE}_*.fastq | \ grep -v "@" | sed "s|$SAMPLE\_.\.fastq-||g" | tr -cd AT | wc -m) echo "scale=4; $AandTs / $AllBases" | bc