Working at the intersection
I have always wondered why an Illumina machine occasionally generates “unpaired” paired-end reads (i.e., when you receive an R1 but no corresponding R2, or vice versa). While I am waiting for a satisfactory answer, I would like to remove any unpaired reads from my data set in the meantime.
Superficially, this aim may appear to be achieved through joining paired-end Illumina reads, which can be done through several software tools (e.g., FastqJoin). In reality, however, all I wish to do is to filter out unpaired paired-end reads. Such a task is best completed by some shell scripting.
First, I define my infiles.
INF1=infile_R1.fastq
INF2=infile_R2.fastq
Second, I check if any unpaired paired reads are present.
c1=$(wc -l $INF1 | awk '{print $1}')
c2=$(wc -l $INF2 | awk '{print $1}')
if (($c1 == $c2)); then echo "TRUE"; else echo "FALSE"; fi
Third, I extract the sequence names (but not the actual sequences, nor the quality scores).
cat $INF1 | grep '^@' | grep -v ',' > $INF1.headers
cat $INF2 | grep '^@' | grep -v ',' > $INF2.headers
Fourth, I remove the unique pair-identifiers from the end of all sequence names. (This may vary depending on your data.)
sed -i 's/\_1\:N\:0\:3//g' *.headers
sed -i 's/\_2\:N\:0\:3//g' *.headers
Fifth, I list all intersecting elements.
comm -12 $INF1.headers $INF2.headers > intersect.headers
Sixth, I extract those reads from the input files, whose sequence names appear in the intersection. (Unfortunately, this step is slow. If anyone can come up with faster code, please write me.)
for i in $(cat intersect.headers); do cat $INF1 | grep -A3 ^$i >> $INF1.intersect; done &
for i in $(cat intersect.headers); do cat $INF2 | grep -A3 ^$i >> $INF2.intersect; done &
Seventh, I perform some file hygiene.
rm *.headers
You then continue to work with the filtered files $INF1.intersect and $INF2.intersect, respectively.
EDIT: Upon bouncing some ideas off of Stackoverflow, I found a much faster code to conduct step 6 of the above pipeline (i.e., extracting those reads from the input files, whose sequence names appear in the intersection). Hence, the following code should be used instead:
grep -A3 -Ff intersect.headers $INF1 >> $INF1.intersect &
grep -A3 -Ff intersect.headers $INF2 >> $INF2.intersect &