CUDA kernel is 6x slower in model than in a separate benchmark

shshao · February 16, 2023, 9:41pm

A kernel only takes 0.08ms in a standalone single-kernel benchmark, but in an actual model that runs many kernels sequentially, the same kernel takes 0.5ms based on Nsight system. I also tried using cuda event and it gave the same latency. I think this is not expected because there should be no context switch or resource contention. What can be the cause?

FYI model/kernel is warmed up in both cases. And the kernel is just a CUTLASS conv kernel.

Robert_Crovella · February 17, 2023, 2:59pm

A common factor in such cases is other activity on the GPU. A kernel that is memory bandwidth bound, for example, will generally run slower when there are other users/consumers of GPU memory bandwidth.

shshao · February 17, 2023, 5:51pm

If you are talking about the other activity than the model, I believe there wasn’t. If you are talking about the other kernels in the model, as I said they are run sequentially, so I guess they are irrelevant?

Robert_Crovella · February 17, 2023, 6:01pm

They might be irrelevant. If there are no data copying to or from the GPU in any of this, or other consumers of GPU bandwidth such as encoder/decoder activity, then my suggestion may be irrelevant.

I don’t know to a certainty what is happening, and if you’re convinced the two cases should be identical, then there is certainly not enough info here for me to suggest anything else. Good luck!

shshao · February 17, 2023, 6:30pm

Thanks! Is there any suggestion what tools/metrics are useful to further debug this? For example, I would like to see the breakdown of the kernel latency in both cases and compare them.

Robert_Crovella · February 17, 2023, 7:31pm

You can use nsight compute to get an indication of why the kernel behavior might be different in both cases. I don’t have any specific metric suggestions to start with, instead I would follow a general strategy like the one outlined here.

I don’t know what you mean by “kernel latency” or “breakdown of kernel latency” in this setting. You are indicating that the same kernel has two very different durations, and you’ve indicated you already used nsight systems to determine kernel duration. If you’re looking at a GPU activity trace line in nsight systems to get this kernel duration info, then you are already at the point of kernel activity, not latency. Latency is the difference in time between when you ask for something and when you get it. Applied to kernels, the only sensible use of it that I know of is the difference in time between when you requested the kernel launch, and when the kernel actually launched. If you are comparing kernel durations like I have indicated or guessed at, then latency is not the issue.

If, in fact, your comparative numbers are not durations, then you can disregard everything I’ve said. There isn’t enough information in your posting to make what you are asking about clear.

I generally don’t find it useful or productive to discuss issues where there is a dearth of information, so I’m unlikely to respond to further requests here.

shshao · February 17, 2023, 7:41pm

I just realized that I mistakenly set different input sizes, that’s why the latency is different.

By kernel latency I just meant kernel duration. I’ll still learn more about the nsight compute, and thanks for the link!

Topic		Replies	Views
Same kernel and data exhibits different performance CUDA Programming and Performance	3	479	December 3, 2021
Some kernel launch is taking much longer (100x) than others in the same Cuda Stream CUDA Programming and Performance	7	382	February 10, 2024
Kernel Performance Discrepancy in Nsight Compute and Systems Nsight Compute	2	151	December 2, 2024
Load balancing Cuda contexts CUDA Programming and Performance	9	2492	November 9, 2009
Odd Slowdown Problem Same function slows down in loop CUDA Programming and Performance	3	9872	February 8, 2008
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
Inconsistent kernel execution times, and affected by Nsight Systems CUDA Programming and Performance	1	305	April 23, 2024
Measuring Kernel Latencies CUDA Programming and Performance	0	305	May 5, 2021
Same Implementation in CUDA and OpenCL but different performance, and OpenCL Faster? CUDA Programming and Performance	2	1214	October 11, 2013
Why CUDA kernel calls takes so long? CUDA Programming and Performance	2	1431	July 17, 2017

CUDA kernel is 6x slower in model than in a separate benchmark

Related topics