Wicked-fast transcript quantification with Salmon

Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:

  • alignment step is slow, especially in splice-aware mode
  • spliced alignments are error-prone
  • huge intermediate files are produced (.bam)
  • transcript quantification from these intermediate files is also slow

At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).

On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!

  1. Install dependencies & salmon
  2. [bash]
    # install dependencies
    ## here it’s important to install boost1.55 & remove older version before
    sudo apt-get remove libboost-all-dev
    sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev

    # clone salmon repo
    git clone git@github.com:COMBINE-lab/salmon.git

    # and build
    cd salmon
    cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb
    make && make install && make test

    # add to .bashrc
    echo "# salmon" >> ~/.bashrc
    echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc

    # open new BASH window (Ctrl + Shift + T) or reload environmental variables
    source ~/.bashrc
    [/bash]

  3. Index transcriptome
  4. [bash]salmon index -t transcripts.fa -i transcripts.index[/bash]

  5. Quantify transcript abundances
  6. You have to specify library correctly (-l).
    [bash]salmon quant -p 4 -l SF -i ref/transcripts.index -r <(zcat sample1.fq.gz) -o sample1[/bash]

Note, here the reads are decompressed using process substitution – this is very handy way of providing the preprocess data as input through Unix pipes.

Detecting isoform switching from RNA-Seq

Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.

Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task (fpkm2major.py). This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.

For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25).

[bash]
fpkm2major.py -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt

# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking)
| fpkm2major.py -v -o major_isoforms2.txt

# visualise using cummeRbund
R
library(cummeRbund)
cuff <- readCufflinks("sample1")

# expression plot for isoforms of given gene
geneid="ENSDARG00000013615"
gene<-getGene(cuff, geneid)
expressionPlot(isoforms(gene), showErrorbars=T)
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12); dev.off()

[/bash]