Randomisation of the paired-end read order in FastQ

I’ve been playing with Python trying to randomise the order of paired-end (PE) reads in FastQ. After very unsuccessful afternoon (Python implementation was randomising 1M PE reads in 10 minutes (!)), I’ve decided to try BASH.
BASH-based solution is simple and efficient (12 seconds for 1M PE reads):

paste <(zcat test.1.fq.gz) <(zcat test.2.fq.gz) | paste - - - - | shuf | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 > "random.1.fq"; print $2,$4,$6,$8 > "random.2.fq"}'

If you are interested in random subset of your FastQ file(s) ie 100K, you can specify it with shuf -n 100000.

For large FastQ files it’s good to follow the progress of randomisation. This can be than by pluging pv inside the process. Additionally, the output files can be gzipped on the fly, saving lots of disks I/O operations. Finally, reads can be sampled/randomised from more than one library (reads1_1/2 and reads2_1/2), as follows:

pv -cN zcat reads1_1.fastq.gz reads2_1.fastq.gz | zcat | paste - <(zcat reads1_2.fastq.gz reads2_2.fastq.gz) | paste - - - - | pv -cN shuf | shuf | pv -cN awk | awk -F'\t' '{OFS="\n"; print $1,$3,$5,$7 | "gzip > random_1.fq.gz"; print $2,$4,$6,$8 | "gzip > random_2.fq.gz"}'
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s