Slowdown in kernel calls in Cuda 11.4 compared to Cuda 10.2

Hello, after upgrading my project from Cuda 10.2 to Cuda 11.4 I noticed massive slowdown (around 50%) in kernel calls. Even an asynchronous call to an empty kernel function seems to take a massive amount of time. This is true for multiple Ubuntu versions and GPU models.
I’ve seen other people also report slowdowns but hasn’t found what solution they had in the end.

Since even empty function calls seem to take a long time, I’ll share my cmake flags that are related to Cuda, since those are the most likely suspects.

set(CMAKE_CUDA_COMPILER “/usr/local/cuda/bin/nvcc”)
set(CMAKE_CUDA_FLAGS “${CMAKE_CUDA_FLAGS} -Xcompiler -fPIC -rdc=true”)
set(CMAKE_CUDA_STANDARD 11)
set(CMAKE_CUDA_STANDARD_REQUIRED TRUE)
set_target_properties(${PROJECT_NAME} PROPERTIES CUDA_RESOLVE_DEVICE_SYMBOLS ON)
set_target_properties(${PROJECT_NAME} PROPERTIES CUDA_ARCHITECTURES “72”)

Thank you very much.

How do you measure kernel launch overhead? What is “massive amount of time” ?

I measure the overhead by calling an empty function several times on a stream and then calling cudaStreamSynchronize(). Like I said, the time in ms increases at around 50%. I’ll also mention I use OpenCV in my project (it might be the cause for the slowdown).

ie, the psuedo-code.

<test.cu>
static global void Func(Data* data){

}

<main.cpp>
void main() {
cudaStream_t stream;
Data d* = UploadDataToGpuAndReturnPointerToStruct();
auto t = time();
int N = 1000;
for (int i=0; i < N; ++i) {
Func<1, 1024, 0, stream>(d);
}
cudaStreamSynchronize(stream);
printf(“%f”, (time() -t) / N);
return 0;
}

I suggest filing a bug with a compileable example program that shows the performance degradation. This has been the recommended action in other poss regarding performance issues.

(On a side note, from your description I thought you were measuring kernel launch overhead, but this does not seem to be the case looking at your code example since you stop the time only after waiting for the kernels to complete)

1 Like