Wicked-fast transcript quantification with Salmon

Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:

  • alignment step is slow, especially in splice-aware mode
  • spliced alignments are error-prone
  • huge intermediate files are produced (.bam)
  • transcript quantification from these intermediate files is also slow

At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).

On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!

  1. Install dependencies & salmon
  2. [bash]
    # install dependencies
    ## here it’s important to install boost1.55 & remove older version before
    sudo apt-get remove libboost-all-dev
    sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev

    # clone salmon repo
    git clone git@github.com:COMBINE-lab/salmon.git

    # and build
    cd salmon
    cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb
    make && make install && make test

    # add to .bashrc
    echo "# salmon" >> ~/.bashrc
    echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc

    # open new BASH window (Ctrl + Shift + T) or reload environmental variables
    source ~/.bashrc

  3. Index transcriptome
  4. [bash]salmon index -t transcripts.fa -i transcripts.index[/bash]

  5. Quantify transcript abundances
  6. You have to specify library correctly (-l).
    [bash]salmon quant -p 4 -l SF -i ref/transcripts.index -r <(zcat sample1.fq.gz) -o sample1[/bash]

Note, here the reads are decompressed using process substitution – this is very handy way of providing the preprocess data as input through Unix pipes.

SOLiD RNA-Seq & splice-aware mapping

I’ve lost quite a lot of time trying to align color-space RNA-Seq reads. SHRiMP paper explains nicely, why it’s important to align SOLiD reads in color-space, instead of converting color-space directly into sequence-space. Below, you can find the simplest solution I have found, using tophat, relying on bowtie mapper (bowtie2 doesn’t support color-space) and color-space reads in .csfasta.

# generate genome index in color-space
bowtie-build –color GENOME.fa GENOME

# get SOLiD reads from SRA if you don’t have them already in .csfasta
abi-dump SRR062662

# tophat splice-aware mapping in color-space
mkdir tophat
for f in READS_DIR/*.csfasta; do
s=`echo $f | cut -f2 -d’/’ | cut -f1 -d’.’`
if [ ! -d tophat/$s ]; then
echo `date` $f $s
tophat -p 4 –no-coverage-search –color -o tophat/$s –quals $ref $f READS_DIR/${s}_QV.qual