GPU Computation Time Follows an Inverse Exponential Trend in CUDA as Shots Increase

jules.gransden · February 18, 2025, 11:35am

I’ve been implementing belief propagation (BP) for quantum error correction on a GPU, using CUDA to accelerate the decoding process. My setup involves running BP independently for each shot, where I allocate (cudaMalloc), transfer (cudaMemcpy), execute, and then free (cudaFree) memory separately per shot. Initially, I expected a mostly linear scaling in computation time as the number of shots increased. However, profiling revealed an inverse exponential trend—as shots increased, the per-shot GPU computation time dropped significantly before leveling off. This was surprising because, given my memory management approach, I assumed each shot would execute in isolation. I suspect that kernel execution is still leveraging some level of concurrency across streaming multiprocessors, or that CUDA’s runtime optimizations are overlapping operations in a way I hadn’t anticipated.

For context, 1 SHOT is a entire sequence of error correction including:
1- loading memory to device
2- performing 10 iterations(ROUNDS) of serial kernel execution
3- loading memory back to host

The 1st column in the the table represents the time for 1 ROUND. so
total_time / (N_rounds*N_shots)
Does anyone have any ideas into how and why this is happening?

PS: I am using an H100

Thank you!

njuffa · February 18, 2025, 12:31pm

How quickly did the performance ramp up, e.g. how many iterations? How big a performance difference are we talking about?

“Warm-up” of GPU-internal hardware resources such as caches, TLBs, memory controllers, etc should reach steady state after 5 to 10 iterations. Depending on kernel duration, dynamic clock boosting might take more iterations until it reaches steady state (on actively cooled cards I have seen the clock boosting process, working in conjunction with fan speed adjustments, continue for up to a couple of minutes).

That seems quite suboptimal from a performance perspective. Allocating and freeing memory are comparatively slow activities even on fast host systems. Ideally you would re-use allocated memory blocks as many times as you can. If the allocation sizes differ, are the differences small enough that you can allocate a block of maximum expected size once, and then keep re-using that indefinitly ?

jules.gransden · February 18, 2025, 1:03pm

I aded an image to my initial post, which might shed some light on quntitative data I’m getting.
I completely agree with you it is not the best implementation when running multiple rounds… but then again, my question still remains, why does computation time decrease if I compute multiple iterations serially.

njuffa · February 18, 2025, 1:13pm

I am afraid I have no clue what a “shot” is. Is this technical jargon used in a particular field? A translation from another language into English? How is “shot” different from “round”?

In any event, the difference in the numbers above is way too big to be explained by the factors I enumerated earlier.

I could imagine that the observation has something to do with how well the GPU is utilized. I don’t see information as to what GPU this is running on, but assuming its on a high-end GPU with 10K to 20K “CUDA cores” and assuming N_shots is strongly correlated with the available parallelism in the workload then we would expect full performance to be reached once all the “CUDA cores” are actively processing data in parallel.

Curefab · February 18, 2025, 1:56pm

Is the left BP time column measured per shot and round or are you taking an average?

Initially you have constant minimum time, which is divided by increasing factors.

Later the number of shots is the deciding factor and the time per shot approximates a fixed value?

When taking an average, the first or first few runs could have one-time costs as Njuffa explained. If you divide by some number of runs that effect gets lower.

Topic		Replies	Views
CUDA perormances CUDA Programming and Performance	10	7276	January 22, 2008
Kernel execution time issue Execution time increases after several kernel launches CUDA Programming and Performance	1	6946	August 2, 2010
bigger computation time for less operations CUDA Programming and Performance	0	499	December 16, 2010
GPU cimputing time problem cutGetTimerValue bug? CUDA Programming and Performance	1	1064	March 11, 2011
Execution time is not proportional to the time steps CUDA Programming and Performance	5	1157	May 6, 2012
Contradictory results for Bayer Decoding when executed on GPGPU using CUDA-C CUDA Programming and Performance	0	499	November 4, 2012
cuda gpu slower than cpu CUDA Programming and Performance	2	1140	May 1, 2012
Performance hit in CUDA program that calls kernel repeatedly within a for loop CUDA Programming and Performance	2	4059	January 6, 2012
GPU is slower than CPU CUDA Programming and Performance	13	18579	November 4, 2025
Comparing GPUs to CPUs in a particular situation CUDA Programming and Performance	7	17323	April 4, 2011

GPU Computation Time Follows an Inverse Exponential Trend in CUDA as Shots Increase

Related topics