Alumnus of Freie Universität Berlin – Michael Grünstäudl, PhD

Successful habilitation in botany and bioinformatics

Filtering out unpaired raw reads from Illumina data

Working at the intersection

I have always wondered why an Illumina machine occasionally generates “unpaired” paired-end reads (i.e., when you receive an R1 but no corresponding R2, or vice versa). While I am waiting for a satisfactory answer, I would like to remove any unpaired reads from my data set in the meantime.

Superficially, this aim may appear to be achieved through joining paired-end Illumina reads, which can be done through several software tools (e.g., FastqJoin). In reality, however, all I wish to do is to filter out unpaired paired-end reads. Such a task is best completed by some shell scripting.

First, I define my infiles.

INF1=infile_R1.fastq
INF2=infile_R2.fastq

Second, I check if any unpaired paired reads are present.

c1=$(wc -l $INF1 | awk '{print $1}')
c2=$(wc -l $INF2 | awk '{print $1}')
if (($c1 == $c2)); then echo "TRUE"; else echo "FALSE"; fi

Third, I extract the sequence names (but not the actual sequences, nor the quality scores).

cat $INF1 | grep '^@' | grep -v ',' > $INF1.headers
cat $INF2 | grep '^@' | grep -v ',' > $INF2.headers

Fourth, I remove the unique pair-identifiers from the end of all sequence names. (This may vary depending on your data.)

sed -i 's/\_1\:N\:0\:3//g' *.headers
sed -i 's/\_2\:N\:0\:3//g' *.headers

Fifth, I list all intersecting elements.

comm -12 $INF1.headers $INF2.headers > intersect.headers

Sixth, I extract those reads from the input files, whose sequence names appear in the intersection. (Unfortunately, this step is slow. If anyone can come up with faster code, please write me.)

for i in $(cat intersect.headers); do cat $INF1 | grep -A3 ^$i >> $INF1.intersect; done &
for i in $(cat intersect.headers); do cat $INF2 | grep -A3 ^$i >> $INF2.intersect; done &

Seventh, I perform some file hygiene.

rm *.headers

You then continue to work with the filtered files $INF1.intersect and $INF2.intersect, respectively.

 

EDIT: Upon bouncing some ideas off of Stackoverflow, I found a much faster code to conduct step 6 of the above pipeline (i.e., extracting those reads from the input files, whose sequence names appear in the intersection). Hence, the following code should be used instead:


grep -A3 -Ff intersect.headers $INF1 >> $INF1.intersect &
grep -A3 -Ff intersect.headers $INF2 >> $INF2.intersect &

 

 

Der Beitrag wurde am Monday, den 3. August 2015 um 16:24 Uhr von Michael Grünstäudl veröffentlicht und wurde unter bioinformatics abgelegt. Sie können die Kommentare zu diesem Eintrag durch den RSS 2.0 Feed verfolgen. Sie können einen Kommentar schreiben, oder einen Trackback auf Ihrer Seite einrichten.

Leave a Reply

Captcha
Refresh
Hilfe
Hinweis / Hint
Das Captcha kann Kleinbuchstaben, Ziffern und die Sonderzeichzeichen »?!#%&« enthalten.
The captcha could contain lower case, numeric characters and special characters as »!#%&«.