Extracting those that mapped
Recently, I found myself in need of extracting only those reads from a sequence alignment map (SAM) file that actually mapped to the reference genome, while maintaining the separation into the paired-end read design. By using a combination of samtools, bedtools and awk, this can be done very efficiently:
INFL=MySamples.sam STEM=${INF%.sam*} TMP1=${STEM}.extracted.sam TMP2=${STEM}.sorted.sam OTFL=${STEM}.fastq # Extracting only paired mapped reads samtools view -b -F12 $INFL > $TMP1 #samtools view -b -F4 $INFL > $TMP1 # Sorting SAM file by read header # (necessary for the command `bedtools bamtofastq`) samtools sort -n $TMP1 > $TMP2 # Extracting fastq sequences bedtools bamtofastq -i $TMP2 -fq $OTFL # Separating successfully mapped paired reads # into an R1 file and an R2 file cat $OTFL | awk -v n=4 'BEGIN{f=2} { if((NR-1)%n==0){f=1-f}; print > "out_R" f ".fastq" }' mv out_R-1.fastq ${STEM}_R1.fastq mv out_R2.fastq ${STEM}_R2.fastq