Speeding Up Programs by 120x

This week, on a whim, I helped another lab with some analysis work and learned a bit about algorithms in the process. It turned out that the original Perl program ran quite slowly, so I thought about rewriting it in Python and then optimizing it further.

Algorithm Principle

In techniques like DamID-seq or ChIP-seq, since they essentially record a form of “affinity,” the data contains a lot of nonspecific affinity noise. Therefore, we need an algorithm to find specifically affinity peaks, avoiding mistaking noise generated by chance for specific affinity peaks.

The implementation idea of the find_peaks algorithm is actually quite simple, roughly divided into the following steps:

Load data: The damidseq_pipeline aligns raw sequencing results to a reference genome and subsequently generates a score reflecting affinity. Each record also includes information like chr, start, end, corresponding to the genome, alignment start, and end positions.
Calculate percentiles: Sort the non-empty score values from the results, then calculate the score value corresponding to specified percentiles (determined by the starting percentile min_quant and step size step).
Calculate occurrence probability: Shuffle the records, then filter based on the percentile thresholds from step 2. If min_count or more consecutive records meet the threshold, it’s considered a peak. Repeat N times, cumulatively recording the size and count of peaks found for each threshold.
Linear regression: For such calculations on randomly shuffled data, we can assume that values exceeding the threshold are uniformly distributed across the entire dataset. Therefore, the probability of encountering another qualifying value after one is constant. If the average number of observed peaks of length $m$ is $\mathbf{F}(m)$ , then we should have $\mathbf{F}(m+1)\approx\mathbf{F}(m)\cdot P$ , where $P$ is a constant reflecting the density of values above the threshold. Thus, the relationship between observed peak count and size should follow an exponential distribution. Taking the logarithm of the observed counts allows us to fit a linear relationship.
Count real occurrence frequency: Similar to step 3, but here we do not shuffle the records. We respect their chromosome assignment and sequential order, counting the relationship between peak size and number for each threshold.
Calculate False Discovery Rate (FDR): The fitted relationship from step 4 gives one per percentile threshold. Using this result, we can predict the number of peaks of a specific size expected from purely random distribution for each threshold. Comparing this with the real observed number yields the FDR for observing a peak of that length at that threshold. By controlling the FDR, we can determine the minimum length required for a peak to be considered significant under each threshold condition.
Identify significant peaks: Filter out the significant peaks based on the results from the previous step.
Merge significant peaks: Merge significant peaks within the same chromosome (often duplicates from being significant under different thresholds), obtaining the maximum score and minimum FDR for overlapping peaks, and write the results to a file.

Replicating the Algorithm

With the above logic, writing the program was quick. To allow the program to run with minimal dependencies, we implemented all functionality using only Python’s built-in libraries. After writing this version, I tested the efficiency of the Perl and Python programs with a small N, with results roughly as follows:

$N=10$	Perl	Python
Time (sec)	240s	~210s

Optimizing the Algorithm

Since the improvement wasn’t significant, we were certainly not satisfied. Running a profile with the amazing scalene revealed that a lot of time was spent checking if score was $0$ . This logic was originally copied from the Perl implementation, so for every record, whether calculating percentile thresholds or counting occurrence frequencies, it first needed to check if its score was $0$ and ignore it if so.

Thus, I tried using Python’s built-in filter function to filter the results, avoiding the if branch check (built-in implementations are generally faster):

1
def call_peaks_unified_redux(...):
2
    ...
3
    total = len(probes)
4
    for i in range(total):
5
        chrom, start, end, score = probes[i]
6
        if not score:
7
            continue
8
    for chrom, start, end, score in filter(lambda x: x[3], probes):
9
    ...

Testing again showed good results, with a noticeable speed increase.

$N=10$	Perl	Python
Time (sec)	240s	~90s

But! I was still not satisfied. This still took over a minute to run, and that was with a reduced iteration count. In practice, a larger N ensures that random sampling covers more possibilities, making the linear fit more accurate. So, we could think about further optimization.

Analyzing the code carefully, we can see:

load_gff: When loading the file, records with score of 0 or NA are also stored in the result array.
find_quant: When encountering a record with score 0, it skips it, which doesn’t affect the result.
find_randomised_peaks: When randomly shuffling records, if only a subset is specified, it might include records with score 0.
call_peaks_unified_redux: When encountering a record with score 0, it skips it, which doesn’t affect the result.

Therefore, a large portion of the loaded data consists of records with score 0, but we don’t need to consider them in our calculations. Their only impact is in find_randomised_peaks when taking a subset. So, if we can handle this part of the logic properly, we can filter out these records upon loading, reducing subsequent computation.

First, rewrite the load_gff function to ignore empty records directly. To keep calculated metrics like coverage consistent, we move some logic from later functions here:

1
def load_gff(fn: str) -> list[PROBE]:
2
    global args, RAW_READS_NUM
3
    line_num = 0
4
    total_coverage = 0
5
    parsed_result: list[PROBE] = list()
6
    sys.stderr.write(f"Reading input file: {fn} ...\n")
7
    with open(fn, "r") as f:
8
        for line in f:
14 collapsed lines
9
            line_num += 1
10
            if line_num % 10000 == 0:
11
                sys.stderr.write(f"Read {line_num} lines\r")
12
            ll = line.strip().split("\t")
13
            # skip empty lines
14
            if len(ll) < 4:
15
                continue
16
            if len(ll) == 4:
17
                # bedgraph
18
                chrom, start, end, score = ll
19
            else:
20
                # GFF
21
                chrom = ll[0]
22
                start, end, score = ll[3:6]
23
            # increment raw reads number
24
            RAW_READS_NUM += 1
25
            # skip empty reads
26
            if not args.no_discard_zeros and (score == "NA" or not float(score)):
27
                continue
28
            # record read
29
            parsed_result.append(
30
                (
31
                    CHROM(chrom),
32
                    START(POS(int(start))),
33
                    END(POS(int(end))),
34
                    SCORE(float(score) if score != "NA" else 0),
35
                )
36
            )
37
            # record total coverage
38
            total_coverage += int(end) - int(start)
39
    sys.stderr.write(f"Read {line_num} lines\n")
40
    sys.stderr.write("Sorting ...\n")
41
    parsed_result = sorted(parsed_result, key=lambda x: (x[0], x[1]))
42
    sys.stderr.write(f"Total coverage was {total_coverage} bp\n")
43
    return parsed_result

Then, in the find_quant function, no extra checks are needed:

1
def find_quant(probes: list[PROBE]):
2
    global args
3
    total_coverage = 0
4
    frags: list[SCORE] = list()
5
    for (chrom, start, end, score) in filter(lambda x: x[3], probes):
6
        total_coverage += end - start
7
        frags.append(score)
8
    frags = [x[3] for x in probes]
9
    frags = sorted(frags)
10
    sys.stderr.write(f"Total coverage was {total_coverage} bp\n")
8 collapsed lines
11
    quants = [
12
        (q * args.step, int(q * args.step * len(frags)) - 1)
13
        for q in range(math.ceil(args.min_quant / args.step), math.ceil(1 / args.step))
14
    ]
15
    for cut_off, score_idx in quants:
16
        sys.stdout.write(f"\tQuantile {cut_off:0.2f}: {frags[score_idx]:0.2f}\n")
17
    peakmins = [THRESH(frags[score_idx]) for (_, score_idx) in quants]
18
    return peakmins

The same applies to call_peaks_unified_redux:

1
def call_peaks_unified_redux(...):
2
    ...
3
    for pm in peakmins:
4
        ...
5
        for chrom, start, end, score in filter(lambda x: x[3], probes):
6
        for chrom, start, end, score in probes:
7
        ...

find_randomised_peaks is a bit more complex, but we can still do no worse than the original. In the original program, if frac was specified, only the fixed first frac records (where frac is an integer) were used to estimate expected frequency. This meant the proportion of empty records in the first frac could be greater or less than their overall proportion, and this bias would appear in every shuffle. So, we adopt a more general approach:

Let frac be a float $r\in(0,1]$ , representing the proportion of records we need.
When generating the dataset, simulate taking $\left\lfloor r\cdot M\right\rfloor$ records from the total $M$ , and calculate the number of non-empty records $K$ within that sample.
Use $K$ as the sample size to sample from the non-empty records. The resulting subset is used for expectation estimation.

1
def find_randomised_peaks(probes: list[PROBE], peakmins: list[THRESH]):
5 collapsed lines
2
    global args, RAW_READS_NUM
3
    peak_count = None
4
    sys.stdout.write("Duplicating ...\n")
5
    pbs = probes.copy()
6
    sys.stdout.write("Calling peaks on input file ...\n")
7
    for iter_num in range(args.n):
8
        sys.stdout.write(f"Iteration {iter_num+1}: [shuffling ...]      \r")
9
        # This is a naive approximation to randomly sample a fraction
10
        # as the full sequence doesn't contain empty reads
11
        # (but no worse than the original approach anyway)
12
        if args.frac:
13
            pbs = pbs[: args.frac]
14
            num_to_sample = sum(
15
                map(
16
                    # This makes sure that it works for both
17
                    # data including and excluding empty reads
18
                    lambda x: x <= int(len(probes) * args.frac),
19
                    random.sample(range(RAW_READS_NUM), int(RAW_READS_NUM * args.frac)),
20
                )
21
            )
7 collapsed lines
22
            pbs = random.sample(probes, num_to_sample)
23
        # The built-in shuffle uses the same algorithm (Fisher-Yates)
24
        # as the original Perl program
25
        random.shuffle(pbs)
26
        _, peak_count = call_peaks_unified_redux(
27
            iter_num, pbs, peakmins, None, peak_count
28
        )
29
    return peak_count

Other minor adjustments were made to align with these changes, but they are straightforward and won’t be listed one by one.

We tested again, and the results were amazing!

Python only took two seconds to complete the test

Running with default parameters (N: $100$ , fdr: $0.01$ , min_count: $2$ , min_quant: $0.95$ , step: $0.01$ ), the original Perl program took about $40$ minutes, while the current Python program only needs about ten seconds. The intuitive feeling is:

Animated GIF showing how fast the Python program runs The improvement is very perceptible! The code has been open-sourced under the GPLv3 license, repository link.