A kernel only takes 0.08ms in a standalone single-kernel benchmark, but in an actual model that runs many kernels sequentially, the same kernel takes 0.5ms based on Nsight system. I also tried using cuda event and it gave the same latency. I think this is not expected because there should be no context switch or resource contention. What can be the cause?
FYI model/kernel is warmed up in both cases. And the kernel is just a CUTLASS conv kernel.
A common factor in such cases is other activity on the GPU. A kernel that is memory bandwidth bound, for example, will generally run slower when there are other users/consumers of GPU memory bandwidth.
If you are talking about the other activity than the model, I believe there wasn’t. If you are talking about the other kernels in the model, as I said they are run sequentially, so I guess they are irrelevant?
You can use nsight compute to get an indication of why the kernel behavior might be different in both cases. I don’t have any specific metric suggestions to start with, instead I would follow a general strategy like the one outlined here.
I don’t know what you mean by “kernel latency” or “breakdown of kernel latency” in this setting. You are indicating that the same kernel has two very different durations, and you’ve indicated you already used nsight systems to determine kernel duration. If you’re looking at a GPU activity trace line in nsight systems to get this kernel duration info, then you are already at the point of kernel activity, not latency. Latency is the difference in time between when you ask for something and when you get it. Applied to kernels, the only sensible use of it that I know of is the difference in time between when you requested the kernel launch, and when the kernel actually launched. If you are comparing kernel durations like I have indicated or guessed at, then latency is not the issue.
If, in fact, your comparative numbers are not durations, then you can disregard everything I’ve said. There isn’t enough information in your posting to make what you are asking about clear.
I generally don’t find it useful or productive to discuss issues where there is a dearth of information, so I’m unlikely to respond to further requests here.