What fraction of the genome is expressed?

Recently, I was interested to find out what fraction of genome is expressed. This can be computed quite easily from RNA-Seq alignments using bedtools and some simple scripting. Note, it’s crucial to use -split parameter in bedtools genomecov to report correct coverage from split alignments. Otherwise, the introns that are spliced-out will be treated as expressed regions.

[bash]
# get genomic regions with at least n reads aligned
n=3
ref=genome
for f in *.bam; do
echo `date` $f;
# get genomic intervals covered by n reads
bedtools genomecov -split -ibam $f -g $ref.fa.fai -dz | awk -F’t’ ‘BEGIN {OFS = FS} {if ($3>=’$n’) {print $1,$2,$2+1}}’ | bedtools merge > $f.${n}reads.bed;
# report bp covered by n reads
awk ‘{sum+=$3-$2} END {print sum}’ $f.${n}reads.bed
done; date
[/bash]

How stranded is stranded RNA-Seq protocol?

After identifying some antisense reads in stranded RNA-Seq experiments, I got interested how stranded is stranded RNA-Seq. Here, I’ll describe my methodology in more details.

  1. Quantify reads for all genes / exons
  2. Initially, I’ve been using bedtools2 for this task, but due to its limitations (bedtools coverage needed to be run twice for sense & antisense, while bedtools multicov was terribly slow for large .gtf files) I have written my own tool for coverage calculation (bam2cov.py & bam2cov_pool.py).
    [bash]
    mkdir stranded
    for f in *.bam; do
    s=`echo $f | cut -f1 -d’.’`
    if [ ! -s stranded/$s.gtf.gz ]; then
    echo `date` $s
    bam2cov.py -i $s.bam -b REF.gtf | gzip > stranded/$s.gtf.gz
    fi
    done; date
    [/bash]

  3. Calculate sense-strand enrichment
  4. Here, I start from selecting interesting features from .gtf.gz and saving these in tmp.bed. Subsequently, I scan tmp.bed for overlapping features on antisense and print only those with no overlap (bedtools intersect -S -v -wa). Finally, sense-strand enrichment is calculated and reported.
    This is done for all features (`.`), genes (`gene`) and exons (`exon`).
    [bash]
    cd stranded
    for cat in . gene exon; do
    echo `date` category $cat;
    for f in *.gtf.gz; do
    # select interesting features and save to bed
    zgrep -P "t${cat}t" $f | awk -F’t’ -v OFS=’t’ ‘{print $1,$4,$5,$10,$11,$7}’> tmp.bed;
    # select features with no overlap on antisense & calculate enrichment
    echo -e "${f}t"`bedtools intersect -S -v -wa -a tmp.bed -b tmp.bed |
    awk -F’t’ -v OFS=’t’ ‘{sum1+=$4; sum2+=$5} END {print sum1,sum2,sum1/sum2}’`;
    done;
    done;
    rm tmp.bed; date
    [/bash]

Summarising, I have found highly variable sense-strand read enrichment (6.39-27.05; median: 9.85) in our RNA-Seq stranded data (6 time points with 3 RNA fractions in colours). I have obtained slightly lower enrichment for genes (5.98-24.07; median: 8.74), probably due to some antisense genes / exons not annotated.
Overall, red RNA fraction tends to have higher sense-strand read enrichment that blue/orange. The question `why there is such high variability in enrichment between time points / RNA fractions` remains to be addressed…

Visit Biostars for discussion.