Michael Grünstäudl (Gruenstaeudl), PhD

Filtering out unpaired raw reads from Illumina data

Working at the intersection

I have always wondered why an Illumina machine occasionally generates “unpaired” paired-end reads (i.e., when you receive an R1 but no corresponding R2, or vice versa). While I am waiting for a satisfactory answer, I would like to remove any unpaired reads from my data set in the meantime.

Superficially, this aim may appear to be achieved through joining paired-end Illumina reads, which can be done through several software tools (e.g., FastqJoin). In reality, however, all I wish to do is to filter out unpaired paired-end reads. Such a task is best completed by some shell scripting.

First, I define my infiles.

INF1=infile_R1.fastq INF2=infile_R2.fastq

Second, I check if any unpaired paired reads are present.

c1=$(wc -l $INF1 | awk '{print $1}') c2=$(wc -l $INF2 | awk '{print $1}') if (($c1 == $c2)); then echo "TRUE"; else echo "FALSE"; fi

Third, I extract the sequence names (but not the actual sequences, nor the quality scores).

cat $INF1 | grep '^@' | grep -v ',' > $INF1.headers cat $INF2 | grep '^@' | grep -v ',' > $INF2.headers

Fourth, I remove the unique pair-identifiers from the end of all sequence names. (This may vary depending on your data.)

sed -i 's/\_1\:N\:0\:3//g' *.headers sed -i 's/\_2\:N\:0\:3//g' *.headers

Fifth, I list all intersecting elements.

comm -12 $INF1.headers $INF2.headers > intersect.headers

Sixth, I extract those reads from the input files, whose sequence names appear in the intersection. (Unfortunately, this step is slow. If anyone can come up with faster code, please write me.)

for i in $(cat intersect.headers); do cat $INF1 | grep -A3 ^$i >> $INF1.intersect; done & for i in $(cat intersect.headers); do cat $INF2 | grep -A3 ^$i >> $INF2.intersect; done &

Seventh, I perform some file hygiene.

rm *.headers

You then continue to work with the filtered files $INF1.intersect and $INF2.intersect, respectively.

EDIT: Upon bouncing some ideas off of Stackoverflow, I found a much faster code to conduct step 6 of the above pipeline (i.e., extracting those reads from the input files, whose sequence names appear in the intersection). Hence, the following code should be used instead:

grep -A3 -Ff intersect.headers $INF1 >> $INF1.intersect & grep -A3 -Ff intersect.headers $INF2 >> $INF2.intersect &

Der Beitrag wurde am Monday, den 3. August 2015 um 16:24 Uhr von Michael Grünstäudl veröffentlicht und wurde unter bioinformatics abgelegt. Sie können die Kommentare zu diesem Eintrag durch den RSS 2.0 Feed verfolgen. Sie können einen Kommentar schreiben, oder einen Trackback auf Ihrer Seite einrichten.

Freie Universität Berlin

Service-Navigation

Postdoctoral Researcher at the Freie Universität Berlin

Filtering out unpaired raw reads from Illumina data

Leave a Reply

Archives