I have encountered an unusual observation while developing a CUDA software prefetcher for CUTLASS GEMV. My implementation runs the prefetcher concurrently in a separate stream alongside the CUTLASS GEMV. Here’s what I have observed so far:
Performance Improvement Without Warmup:
When I run only the software prefetcher and the CUTLASS GEMV concurrently (in separate streams), I observe a performance improvement compared to running only the CUTLASS GEMV:
Performance Drop After Warmup:
Surprisingly, when I run a warmup kernel for both the prefetcher and the CUTLASS GEMV, the performance improvement essentially disappears. This is counterintuitive because a warmup kernel usually only has a minor impact on benchmark timing, yet in this case, it significantly reduces the observed speedup.
I’m trying to understand why the warmup kernel would have such a dramatic effect on the concurrent performance of the prefetcher and CUTLASS GEMV. Any insights or suggestions would be greatly appreciated.
So with warmup it got faster. What is the unexpected? Why it specifically had an effect on concurrent execution? Or in other words, why the serial execution is so much worse without warmup?
What’s unexpected isn’t that the warmup makes things faster, it’s how much faster they become. The warmup kernel itself only accounts for about 0.02 ms of overhead.
Yet without the warmup, the GEMV runs much slower, far more than that 0.02 ms can explain.
Even more strangely, when I run GEMV together with my software prefetcher, I see a large performance improvement, but this improvement largely disappears once I perform a warmup run before the prefetcher+GEMV experiment.
In other words, the warmup shouldn’t be able to influence performance to this extent, yet it does. That’s the unexpected behavior I’m trying to understand.
cudaEvent based timing might be giving you information that cannot be interpreted easily (and this is somewhat more likely in a multi-stream environment). If this were my experiment, I would probably double-check my interpretation by using nsight systems.
For a similar reason, I would probably also start by just inspecting the timing of the gemv kernel, without any usage of the prefetch kernel.
If that does indeed show a similar performance progression (~2.8ms to ~0.6ms) and there is nothing obvious in the nsight systems timeline clouding things, then the next step could be to look at nsight compute to find out the pareto of performance limiters. You may start to get some concrete insight that way.
The first time you run a kernel may indeed vary performance-wise from subsequent runs. I don’t have an exhaustive list of all the reasons, but one possible contributor is the state of the caches. Nsight compute will by default invalidate the caches (and you can modify this behavior with profiling switches), so I would also pay attention to the kernel duration performance reported by nsight compute when I got to that step. Another possible contributor to first time behavior is lazy loading. I don’t expect that lazy loading could impact things by 1ms or more, but it could affect things in the microsecond range. A side effect of lazy loading is synchronization, and this is one of the things I would consider. It’s hard to spot that in the nsight systems timeline, but you could look at the cudaEvent report vs. the nsight systems timeline to make inferences about where it may be impacting things, if anywhere.
Thank you for your detailed response. You’re absolutely right. When I inspect the timings in Nsight Systems, I do see a different time than I expected. The time matches the time with warm-up
However, I’m not entirely sure why this discrepancy occurs. I had assumed my software prefetcher was functioning as intended, but I want to diagnose the issue and understand the root cause of the timing difference.
Do you have any advice on how I could investigate this further? I’ve already reviewed my code and confirmed that I’m not recording any timing before the kernel launches.
Thanks I am aware that this probably not easy to diagnose. I meant more diagnosing the root cause of why cudaEvents shows a different time measurement compared to Nsight Systems. Do you have any advice by any chance? If not then I completely understand
cudaEvent timestamps is captured at the GPU front end. If you push the cudaEventRecord then have a large amount of additional work pushed by the driver to upload your kernel the recorded timestamp for the event will be significantly before the start of the kernel.
Developer tools use two techniques not available through the CUDA API.
For pre-Blackwell (and for some Blackwell environments) the tool can measure the start timestamp either directly before the kernel launch or the tool can instrument the kernel code to output the start timestamp. The end timestamp is taken when the work completes through a mechanism not available through the current CUDA API.
For Blackwell+ the new hardware event system supports tracing when a grid is launched and completed in the hardware.
cudaEventRecords can run into numerous issues especially when applications are using streams resulting in timestamp values consistent with the hardware front end but not accurate for measuring a kernel.
If you need accurate timing for CUDA kernels then I highly recommend you file a feature request to the CUDA API to add a more accurate method to collect timestamps.