CUDA kernel is 6x slower in model than in a separate benchmark

A kernel only takes 0.08ms in a standalone single-kernel benchmark, but in an actual model that runs many kernels sequentially, the same kernel takes 0.5ms based on Nsight system. I also tried using cuda event and it gave the same latency. I think this is not expected because there should be no context switch or resource contention. What can be the cause?

FYI model/kernel is warmed up in both cases. And the kernel is just a CUTLASS conv kernel.

A common factor in such cases is other activity on the GPU. A kernel that is memory bandwidth bound, for example, will generally run slower when there are other users/consumers of GPU memory bandwidth.

If you are talking about the other activity than the model, I believe there wasn’t. If you are talking about the other kernels in the model, as I said they are run sequentially, so I guess they are irrelevant?

They might be irrelevant. If there are no data copying to or from the GPU in any of this, or other consumers of GPU bandwidth such as encoder/decoder activity, then my suggestion may be irrelevant.

I don’t know to a certainty what is happening, and if you’re convinced the two cases should be identical, then there is certainly not enough info here for me to suggest anything else. Good luck!

Thanks! Is there any suggestion what tools/metrics are useful to further debug this? For example, I would like to see the breakdown of the kernel latency in both cases and compare them.

You can use nsight compute to get an indication of why the kernel behavior might be different in both cases. I don’t have any specific metric suggestions to start with, instead I would follow a general strategy like the one outlined here.

I don’t know what you mean by “kernel latency” or “breakdown of kernel latency” in this setting. You are indicating that the same kernel has two very different durations, and you’ve indicated you already used nsight systems to determine kernel duration. If you’re looking at a GPU activity trace line in nsight systems to get this kernel duration info, then you are already at the point of kernel activity, not latency. Latency is the difference in time between when you ask for something and when you get it. Applied to kernels, the only sensible use of it that I know of is the difference in time between when you requested the kernel launch, and when the kernel actually launched. If you are comparing kernel durations like I have indicated or guessed at, then latency is not the issue.

If, in fact, your comparative numbers are not durations, then you can disregard everything I’ve said. There isn’t enough information in your posting to make what you are asking about clear.

I generally don’t find it useful or productive to discuss issues where there is a dearth of information, so I’m unlikely to respond to further requests here.

I just realized that I mistakenly set different input sizes, that’s why the latency is different.

By kernel latency I just meant kernel duration. I’ll still learn more about the nsight compute, and thanks for the link!