I’ve been implementing belief propagation (BP) for quantum error correction on a GPU, using CUDA to accelerate the decoding process. My setup involves running BP independently for each shot, where I allocate (cudaMalloc), transfer (cudaMemcpy), execute, and then free (cudaFree) memory separately per shot. Initially, I expected a mostly linear scaling in computation time as the number of shots increased. However, profiling revealed an inverse exponential trend—as shots increased, the per-shot GPU computation time dropped significantly before leveling off. This was surprising because, given my memory management approach, I assumed each shot would execute in isolation. I suspect that kernel execution is still leveraging some level of concurrency across streaming multiprocessors, or that CUDA’s runtime optimizations are overlapping operations in a way I hadn’t anticipated.
For context, 1 SHOT is a entire sequence of error correction including:
1- loading memory to device
2- performing 10 iterations(ROUNDS) of serial kernel execution
3- loading memory back to host
The 1st column in the the table represents the time for 1 ROUND. so
total_time / (N_rounds*N_shots)
Does anyone have any ideas into how and why this is happening?
How quickly did the performance ramp up, e.g. how many iterations? How big a performance difference are we talking about?
“Warm-up” of GPU-internal hardware resources such as caches, TLBs, memory controllers, etc should reach steady state after 5 to 10 iterations. Depending on kernel duration, dynamic clock boosting might take more iterations until it reaches steady state (on actively cooled cards I have seen the clock boosting process, working in conjunction with fan speed adjustments, continue for up to a couple of minutes).
That seems quite suboptimal from a performance perspective. Allocating and freeing memory are comparatively slow activities even on fast host systems. Ideally you would re-use allocated memory blocks as many times as you can. If the allocation sizes differ, are the differences small enough that you can allocate a block of maximum expected size once, and then keep re-using that indefinitly ?
I aded an image to my initial post, which might shed some light on quntitative data I’m getting.
I completely agree with you it is not the best implementation when running multiple rounds… but then again, my question still remains, why does computation time decrease if I compute multiple iterations serially.
I am afraid I have no clue what a “shot” is. Is this technical jargon used in a particular field? A translation from another language into English? How is “shot” different from “round”?
In any event, the difference in the numbers above is way too big to be explained by the factors I enumerated earlier.
I could imagine that the observation has something to do with how well the GPU is utilized. I don’t see information as to what GPU this is running on, but assuming its on a high-end GPU with 10K to 20K “CUDA cores” and assuming N_shots is strongly correlated with the available parallelism in the workload then we would expect full performance to be reached once all the “CUDA cores” are actively processing data in parallel.
Is the left BP time column measured per shot and round or are you taking an average?
Initially you have constant minimum time, which is divided by increasing factors.
Later the number of shots is the deciding factor and the time per shot approximates a fixed value?
When taking an average, the first or first few runs could have one-time costs as Njuffa explained. If you divide by some number of runs that effect gets lower.