What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

[bash]
# get genomic regions with at least n reads aligned
n=3
ref=genome
for f in *.bam; do
echo `date` $f;
# get genomic intervals covered by n reads
bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F’t’ ‘BEGIN {OFS = FS} {if ($3>=’$n’) {print $1,$2,$2+1}}’ | bedtools merge > $f.${n}reads.bed;
# report bp covered by n reads
awk ‘{sum+=$3-$2} END {print sum}’ $f.${n}reads.bed
done; date
[/bash]

Detecting isoform switching from RNA-Seq

Alternative splicing produces an array of transcripts from individual gene. Some of these alternative transcripts are expressed in tissue-specific fashion and may play different roles in the cell.

Recently, I got interested in identifying isoform switching from RNA-Seq data, this is detecting changes in the most expressed isoform between conditions. As I couldn’t find any ready solution, I have written my own script for this task (fpkm2major.py). This script simply parse isoforms.fpkm_tracking output from cufflinks and report genes with evidence of isoform switching into tab-delimited output file.

For each gene, number of transcripts, cumulative expression, major isoforms for each condition and each major isoform are reported. Lowly expressed genes in given condition are coded with `0` (--minFPKM < 1), while genes without clear major isoform are marked with `-1` (--frac 0.25).

[bash]
fpkm2major.py -v -i sample1/isoforms.fpkm_tracking -o major_isoforms.txt

# combine data from multiple samples
paste sample1/isoforms.fpkm_tracking <(cut -f10- sample2/isoforms.fpkm_tracking)
| fpkm2major.py -v -o major_isoforms2.txt

# visualise using cummeRbund
R
library(cummeRbund)
cuff <- readCufflinks("sample1")

# expression plot for isoforms of given gene
geneid="ENSDARG00000013615"
gene<-getGene(cuff, geneid)
expressionPlot(isoforms(gene), showErrorbars=T)
dev.copy(svg, paste(geneid,".sample1.svg"), width=12, height=12); dev.off()

[/bash]