{"id":57,"date":"2015-08-03T16:24:40","date_gmt":"2015-08-03T14:24:40","guid":{"rendered":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/?p=57"},"modified":"2015-08-25T17:36:27","modified_gmt":"2015-08-25T15:36:27","slug":"filtering-out-unpaired-raw-reads-from-illumina-data","status":"publish","type":"post","link":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/2015\/08\/03\/filtering-out-unpaired-raw-reads-from-illumina-data\/","title":{"rendered":"Filtering out unpaired raw reads from Illumina data"},"content":{"rendered":"<p><strong>Working at the intersection<\/strong><\/p>\n<p>I have always wondered why an Illumina machine occasionally generates &#8220;unpaired&#8221; paired-end reads (i.e., when you receive an R1 but no corresponding R2, or vice versa). While I am waiting for a satisfactory answer, I would like to remove any unpaired reads from my data set in the meantime.<\/p>\n<p>Superficially, this aim may appear to be achieved through joining paired-end Illumina reads, which can be done through several software tools (e.g., <a href=\"https:\/\/code.google.com\/p\/ea-utils\/wiki\/FastqJoin\">FastqJoin<\/a>). In reality, however, all I wish to do is to filter out unpaired paired-end reads. Such a task is best completed by some shell scripting.<\/p>\n<p>First, I define my infiles.<\/p>\n<p style=\"padding-left: 30px\"><code>INF1=infile_R1.fastq<br \/>\nINF2=infile_R2.fastq<br \/>\n<\/code><\/p>\n<p>Second, I check if any unpaired paired reads are present.<\/p>\n<p style=\"padding-left: 30px\"><code>c1=$(wc -l $INF1 | awk '{print $1}')<br \/>\nc2=$(wc -l $INF2 | awk '{print $1}')<br \/>\nif (($c1 == $c2)); then echo \"TRUE\"; else echo \"FALSE\"; fi<br \/>\n<\/code><\/p>\n<p>Third, I extract the sequence names (but not the actual sequences, nor the quality scores).<\/p>\n<p style=\"padding-left: 30px\"><code>cat $INF1 | grep '^@' | grep -v ',' &gt; $INF1.headers<br \/>\ncat $INF2 | grep '^@' | grep -v ',' &gt; $INF2.headers<br \/>\n<\/code><\/p>\n<p>Fourth, I remove the unique pair-identifiers from the end of all sequence names. (This may vary depending on your data.)<\/p>\n<p style=\"padding-left: 30px\"><code>sed -i 's\/\\_1\\:N\\:0\\:3\/\/g' *.headers<br \/>\nsed -i 's\/\\_2\\:N\\:0\\:3\/\/g' *.headers<br \/>\n<\/code><\/p>\n<p>Fifth, I list all intersecting elements.<\/p>\n<p style=\"padding-left: 30px\"><code>comm -12 $INF1.headers $INF2.headers &gt; intersect.headers<br \/>\n<\/code><\/p>\n<p>Sixth, I extract those reads from the input files, whose sequence names appear in the intersection. (Unfortunately, this step is slow. If anyone can come up with faster code, please write me.)<\/p>\n<p style=\"padding-left: 30px\"><code>for i in $(cat intersect.headers); do cat $INF1 | grep -A3 ^$i &gt;&gt; $INF1.intersect; done &amp;<br \/>\nfor i in $(cat intersect.headers); do cat $INF2 | grep -A3 ^$i &gt;&gt; $INF2.intersect; done &amp;<\/code><\/p>\n<p>Seventh, I perform some file hygiene.<\/p>\n<p style=\"padding-left: 30px\"><code>rm *.headers<\/code><\/p>\n<p>You then continue to work with the filtered files <em>$INF1.intersect<\/em> and <em>$INF2.intersect<\/em>, respectively.<\/p>\n<p>&nbsp;<\/p>\n<p><em>EDIT:<\/em> Upon bouncing some ideas off of <a href=\"https:\/\/stackoverflow.com\/questions\/31795202\/faster-alternative-for-grepping-within-a-loop\" target=\"_blank\">Stackoverflow<\/a>, I found a much faster code to conduct step 6 of the above pipeline (i.e., extracting those reads from the input files, whose sequence names appear in the intersection). Hence, the following code should be used instead:<\/p>\n<p style=\"padding-left: 30px\"><code><br \/>\ngrep -A3 -Ff intersect.headers $INF1 &gt;&gt; $INF1.intersect &amp;<br \/>\ngrep -A3 -Ff intersect.headers $INF2 &gt;&gt; $INF2.intersect &amp;<br \/>\n<\/code><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Working at the intersection I have always wondered why an Illumina machine occasionally generates &#8220;unpaired&#8221; paired-end reads (i.e., when you receive an R1 but no corresponding R2, or vice versa). While I am waiting for a satisfactory answer, I would like to remove any unpaired reads from my data set in the meantime. Superficially, this [&hellip;]<\/p>\n","protected":false},"author":2306,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[57598],"tags":[],"class_list":["post-57","post","type-post","status-publish","format-standard","hentry","category-bioinformatics"],"_links":{"self":[{"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/posts\/57","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/users\/2306"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/comments?post=57"}],"version-history":[{"count":10,"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/posts\/57\/revisions"}],"predecessor-version":[{"id":72,"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/posts\/57\/revisions\/72"}],"wp:attachment":[{"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/media?parent=57"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/categories?post=57"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.fu-berlin.de\/gruenstaeudl\/wp-json\/wp\/v2\/tags?post=57"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}