Using docker for application development

I found Docker super useful, but going through a manual is quite time consuming. Here, very stripped manual to create your first image and push it online 🙂

[bash]
# install docker
wget -qO- https://get.docker.com/ | sh

# add your user to docker group
sudo usermod -aG docker $USER

# check if it’s working
docker run docker/whalesay cowsay "hello world!"

# create an account on https://hub.docker.com
# and login
docker login -u $USER –email=EMAIL

# run image
docker run -it ubuntu

# make some changes ie. create user, install needed software etc

# finally open new terminal & commit changes (SESSIONID=HOSTNAME)
docker commit SESSIONID $USER/image:version

# mount local directory `pwd`/test as /test in read/write mode
docker run -it -v `pwd`/test:/test:rw $USER/image:version some command with arguments

# push image
docker push $USER/image:version
[/bash]

From now, you can get your image from any other machine connected to Internet by executing:
[bash]
docker run -it $USER/image:version
# ie. redundans image
docker run -it -w /root/src/redundans lpryszcz/redundans:v0.11b ./redundans.py -v -i test/{600,5000}_{1,2}.fq.gz -f test/contigs.fa -o test/run1

# you can create alias latest, then version can be skipped on running
docker tag lpryszcz/redundans:v0.11b lpryszcz/redundans:latest
docker push lpryszcz/redundans:latest

docker run -it lpryszcz/redundans
[/bash]

You can add info about your repository at https://hub.docker.com/r/$USER/image/

Advertisements

pysam installation error on Ubuntu 14.04

If you encounter error during pysam installation:
[bash]
sudo easy_install -U pysam

pysam/csamtools.c:8:22: fatal error: pyconfig.h: No such file or directory
#include "pyconfig.h"
^
compilation terminated.
error: Setup script exited with error: command ‘x86_64-linux-gnu-gcc’ failed with exit status 1
[/bash]

try installing python-dev first.
[bash]
sudo apt-get install python-dev
[/bash]

Solution found on github.

Detection of RNA editing

I’ve been asked to help analysing data from RNA editing experiment. Typically, such experiments consists of DNA and RNA sequencing. The purpose of these is to identify the differences between DNA template and resulting RNA product.
The canonical RNA editing enzyme, ADAR, deaminase adenine to inosine (A-to-I editing). Inosine is subsequently detected as guanine by the sequencer. Thus the identification of RNA editing can be simplified down to genotyping of DNA and RNA samples and finding differences between these two.
Importantly, RNA editing can be promiscuous, meaning multiple As are changed into Is in given region. This makes an alignment challenging. In addition, you may ignore most of edited sites due to quality filtering (many SNP clustered together), if you genotype your libraries with standard tools/pipelines ie. GATK. Finally, RNAseq is more error-prone than DNAseq, due to low fidelity of reverse-transcriptase, so you may want to account for that as well.
Ideally, you want stranded RNAseq, then only A>G editing will be observed. But in reality, you may analyse also unstranded RNAseq libraries, then you expect both A>G and T>C edited sites, from transcripts of genes on Watson (+) and Crick (-) strand, respectively.

DNA/RNAseq alignment

For the alignment part, I’ve been using BWA MEM (DNAseq) and STAR (RNAseq). If you use STAR, you should reassign mapping qualities (mapQ). It’s also recommended to split reads at N CIGAR. These two can be accomplished with GATK:

[bash]
java -jar ~/src/GATK/GenomeAnalysisTK.jar -T SplitNCigarReads -R REF.fa -I SAMPLE.bam -o SAMPLE.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS > SAMPLE.split.log 2>&1 &
[/bash]

Finally, you may want to realign both DNAseq and RNAseq around indels using GATK IndelRealigner.

RNA editing identification from BAM files

I’ve written program, bam2RNAediting.py (currently REDiscover), to detect RNA editing in a set of DNAseq and RNAseq experiments. It’s in early developmental phase, but it’s functional already. It can be run simply by:
[bash]
bam2RNAediting.py -d bwamem/DNAseq*.bam -r star/RNAseq*.bam > bam2RNAediting.txt
[/bash]

Quality control

Simple quality control for RNA editing experiment could be looking at the enrichment of A>G (and T>C) edited sites over other changes. If RNA editing is present in your sample you would expect to find quite strong enrichment of A>G (and T>C) changes between DNAseq and RNAseq. This is indeed what I’ve found in my samples.

Let me know how it works for you. I’m looking for your feedback!

What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

[bash]
# get genomic regions with at least n reads aligned
n=3
ref=genome
for f in *.bam; do
echo `date` $f;
# get genomic intervals covered by n reads
bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F’t’ ‘BEGIN {OFS = FS} {if ($3>=’$n’) {print $1,$2,$2+1}}’ | bedtools merge > $f.${n}reads.bed;
# report bp covered by n reads
awk ‘{sum+=$3-$2} END {print sum}’ $f.${n}reads.bed
done; date
[/bash]

Visualisation of repeats alongside NGS data

In one of the previous projects we were interested to look for association between deletions and repeats. At that time, I’ve written small program, repeats2sam.py, identifying perfect direct and inverted repeats in chromosomes. repeats2sam.py uses repeat-match from great MUMmer package and reports SAM-formatted repeats. This can be then easily converted to BAM. It takes several minutes for small genomes (20Mb) and several hours for large genomes (~8 hours for 1.5Gb).
[bash]
# identify direct & inverted repeats
ref=human
repeats2sam.py -v –inverted -i $ref.fa | samtools view -SbuT $ref.fa – | samtools sort – $ref.repeats
samtools index $ref.repeats.bam
[/bash]

Repeats are stored as paired-end reads, so they can be easily visualised alongside any NGS data in IGV:

  • direct & inverted repeats
  • inverted repeats + RNAseq
  • inverted repeats + RNAseq zommed in

If you are interested in DNA direct repeats, either skip –inverted parameter, or select only direct repeats from your BAM file. In addition, you may want to select only repeats fulfilling certain length and distance criteria.
[bash]
# select only inverted repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.inverted.bam | awk ‘$8<5000 && length($10)>=’$i | samtools view -SbuT $ref.fa – | samtools sort – $ref.repeats.inverted.${i}bp && samtools index $ref.repeats.inverted.${i}bp.bam
# select only direct repeats; at least 21bp long and distant by less than 5000bp
i=21; samtools view $ref.repeats.direct.bam | awk ‘$8<5000 && length($10)>=’$i | samtools view -SbuT $ref.fa – | samtools sort – $ref.repeats.direct.${i}bp && samtools index $ref.repeats.direct.${i}bp.bam
[/bash]

All above mentioned programs can be found in github.

Wicked-fast transcript quantification with Salmon

Traditionally, in order to quantify transcript abundance from RNA-Seq, one has to first align the reads onto the reference and analyse these alignments. While widely accepted, this approach has several disadvantages:

  • alignment step is slow, especially in splice-aware mode
  • spliced alignments are error-prone
  • huge intermediate files are produced (.bam)
  • transcript quantification from these intermediate files is also slow

At the RECOMB2015, Rob Patro presented two algorithms enabling rapid transcript quantifications from RNA-Seq. Sailfish is alignment-free method, while its successor, Salmon, perform light-weight alignment, identifying just the super maximal exact matches (SMEMs).

On my data, Salmon is ~3 times faster (6-7min) and uses ~20 times less memory (1.1GB) than STAR (~20min / ~24GB). Note, STAR results (.bam) need to be further analysed (ie. cufflinks) in order to quantify transcripts abundances. This step takes another 10min, thus we get transcript abundances after 6-7 min from Salmon and 30min from STAR + cufflinks. Just for comparison, similar analysis done with tophat2 takes above 6 hours!

  1. Install dependencies & salmon
  2. [bash]
    # install dependencies
    ## here it’s important to install boost1.55 & remove older version before
    sudo apt-get remove libboost-all-dev
    sudo apt-get install libbz2-dev libtbb-dev libboost1.55-all-dev

    # clone salmon repo
    git clone git@github.com:COMBINE-lab/salmon.git

    # and build
    cd salmon
    cmake -DBOOST_ROOT=/usr/include/boost -DTBB_INSTALL_DIR=/usr/include/tbb
    make && make install && make test

    # add to .bashrc
    echo "# salmon" >> ~/.bashrc
    echo "export PATH=$PATH:"`pwd`"/bin" >> ~/.bashrc

    # open new BASH window (Ctrl + Shift + T) or reload environmental variables
    source ~/.bashrc
    [/bash]

  3. Index transcriptome
  4. [bash]salmon index -t transcripts.fa -i transcripts.index[/bash]

  5. Quantify transcript abundances
  6. You have to specify library correctly (-l).
    [bash]salmon quant -p 4 -l SF -i ref/transcripts.index -r <(zcat sample1.fq.gz) -o sample1[/bash]

Note, here the reads are decompressed using process substitution – this is very handy way of providing the preprocess data as input through Unix pipes.

Detecting isoform switching from RNA-Seq

Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.

Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task (fpkm2major.py). This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.

For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25).

[bash]
fpkm2major.py -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt

# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking)
| fpkm2major.py -v -o major_isoforms2.txt

# visualise using cummeRbund
R
library(cummeRbund)
cuff <- readCufflinks("sample1")

# expression plot for isoforms of given gene
geneid="ENSDARG00000013615"
gene<-getGene(cuff, geneid)
expressionPlot(isoforms(gene), showErrorbars=T)
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12); dev.off()

[/bash]