Culprit-finding submit queue

Fri Nov 21 2025

I’ve been spending some time modeling submit queues recently. If you’ve worked on a large enough team, you know the pain: you merge a PR, go to grab coffee, and come back to find the build is broken. The “tree is red,” and nobody can merge until the culprit is found and reverted.

As the rate of changes increases, we usually start batching them together to save resources. But that introduces a new problem: when a batch fails, which change caused it?

The “Batch & Pray” Dilemma

The traditional approach is simple: group a bunch of changes, run the tests, and if they pass, merge them all. This works great when everything is green.

But when a test fails, you’re stuck. You know one of these ten changes broke the build, but you don’t know which one. The system usually has to reject the whole batch or stop everything to run a binary search (bisect) to find the bad commit. This kills throughput exactly when you need it most—when the queue is backing up.

Sparse Bernoulli Grouping

I started looking into “Group Testing” algorithms—the kind used in medical testing to screen large populations with few tests. The specific flavor that fits here is Sparse Bernoulli culprit finding.

The idea is to assign each change to multiple, overlapping small batches (minibatches) instead of one big one.

Distribute: We assign each change to $k$ random minibatches.
Test: We run all these minibatches in parallel.
Score: If a change is bad, it poisons every batch it touches. If a change is good, it might get unlucky and share a batch with a culprit, but it should also appear in passing batches (unless it’s really unlucky).

In the simulation code I’ve been playing with (written in Go), I use a few heuristics to tune this:

Dynamic Batch Count ($T$): The number of minibatches scales logarithmically with the queue size ($N$). Specifically, $T \approx 10 \log_10 N$. This keeps the resource usage reasonable even as the queue grows.
Replication Factor ($k$): We calculate how many batches each change should be in based on the expected number of culprits. The formula is roughly $k \approx T / (d + 1)$, where $$d$$ is the expected number of bad changes (usually around 2).
Scoring: We calculate a “Suspicion Score” for each change. It’s simply the percentage of that change’s batches that failed. If the score is above a threshold (I’m using 0.75), we reject the change.

There’s also a concept of Weighted Scoring. If we know certain tests are flaky, we can trust their failures less. In the Go implementation, if a batch fails, we weight that failure by the reliability of the test that failed. A 99% reliable test failing carries more weight than a 50% reliable test.

Interactive Simulation

Click on any change node to toggle it as a culprit. Hover over the changes in the “Sparse Bernoulli” panel below to see how they map to batches. Notice how culprits cause every batch they touch to fail.

Good Change

Bad Change (Culprit)

Suspicion Score

Changes: 16

Minibatches: 8

Traditional: "Batch & Pray"

⬇ Group into Single Batch ⬇

Batch #1
[A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P]

BATCH FAILED

Logic: Combine all pending changes into one big test batch.

The Problem: If any change is bad, the entire batch fails.

Consequence: The system rejects everyone (innocent changes included) or halts to bisect.

New: Sparse Bernoulli Grouping

⬇ Distribute into Overlapping Minibatches ⬇

67%

33%

100% (REJECT)

33%

67%

33%

67%

33%

Logic: Assign each change to multiple random small batches.

The Magic: Culprits fail every batch they touch. Innocent changes might share a failing batch with a culprit, but they also exist in passing batches (unless unlucky).

It’s a probabilistic approach, so it’s not magic, but the math suggests we can identify culprits without halting the entire line.