Fast interval selection in Python

Lately, I’ve been working on a program that computes depth of coverage from BAM file given some genome intervals i.e. genes or exons ( The performance of this task is heavily affected by the time needed for selecting interesting intervals.

At first, I have stored intervals as a list of tuples and I’ve used filter function to select intervals of interest, but the performance of this implementation wasn’t satisfactory (1m29s for test set).

# populate intervals
c2i[chrom] = [(start, end, strand, name), ]

# select intervals
ivals = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])

Later, I’ve used numpy.array, which turned out to be much faster (0m07s). Noteworthy, numpy implementation also uses less memory as object are store more efficiently in array (13 bytes per interval; 3x 4b for ‘uint32′ + 1b for ‘bool_’) than in list of tuples (88 bytes per interval; sys.getsizeof((100,1000,1,100))).

import numpy as np
dtype = np.dtype({‘names’: [‘start’, ‘end’, ‘strand’, ‘entry_id’],
‘formats’: [‘uint32’, ‘uint32’, ‘bool_’, ‘uint32’]})
# populate intervals
chr2intervals[chrom] = np.array(data, dtype=dtype)

# select intervals
ivals = c2i[chrom][np.any([np.all([ c2i[chrom][‘start’]>=s, c2i[chrom][‘start’]<=e ], axis=0),
np.all([ c2i[chrom][‘end’] >=s, c2i[chrom][‘end’] <=e ], axis=0),
np.all([ c2i[chrom][‘start’]< s, c2i[chrom][‘end’] > e ], axis=0)], axis=0)]

Benchmarking code:
import numpy as np
from datetime import datetime
from bam2cov import load_intervals

# load data
chr2intervals, entries = load_intervals(‘stranded/DANRE.gtf’, 0)

# define chromosome name and length
chrom, chrlen = ‘1’, 60348388

# test
sample = random.sample(xrange(0, chrlen), 10000)
counts = np.zeros(entries, dtype=’uint’)
t0 =
for ii,s in enumerate(sample, 1):
if not ii%1000:
print ii
e = s+100
# filter from list – uncomment if needed
#selected = filter(lambda x: s<x[0]<e or s<x[1]<e, c2i[chrom])
# select from numpy.array
selected = c2i[chrom][/c][np.any([np.all([s<c2i[chrom][/c][‘start’], c2i[chrom][/c][‘start’]<e], axis=0), np.all([s<c2i[chrom][/c][‘stop’], c2i[chrom][/c][‘stop’]<e], axis=0)], axis=0)]
for s, e, strand, i in selected:
counts[i] += 1
print dt, counts.sum()


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s