I modified the 0_Introduction/clock example to test my performance losses on a complex problem when using Dynamic Parallelism.
I observed huge time delays when launching kernels from within kernels and I am trying to understand why this is happening.
I added the -rdc=true
flag to the nvcc command line to enable relocatable device code, which is required for Dynamic Parallelism.
And then I measured the elapsed time for different kernel executions.
cudaDeviceSynchronize();
std::chrono::high_resolution_clock::time_point t1, t2, t3;
t1 = std::chrono::high_resolution_clock::now();
timedReduction<<<NUM_BLOCKS, NUM_THREADS, sizeof(float) * 2 * NUM_THREADS>>>(
dinput, doutput, dtimer);
cudaDeviceSynchronize();
t2 = std::chrono::high_resolution_clock::now();
int gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << "timedReduction duration " << gpu_duration << std::endl;
As you can see, I synchronized the device before starting the timer, and then I measured the time after the kernel execution including the synchronization.
The first execution of this code gives me an output of
2529us
and a second call to the same code gives me an output of 6us
.
So I guess some kind of kernel compilation/preparation is done.
The next test measures calling an empty kernel:
__global__ void empty_test2() {}
cudaDeviceSynchronize();
t1 = std::chrono::high_resolution_clock::now();
empty_test2<<<NUM_BLOCKS, NUM_THREADS, 0>>>();
cudaDeviceSynchronize();
t2 = std::chrono::high_resolution_clock::now();
gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << "empty2 duration " << gpu_duration << std::endl;
The output for this code is 11us
and 4us
on the second call. Pretty normal so far.
When I now introduce dynamic parallelism by calling a kernel from within a kernel, the times go through the roof:
__global__ void empty_test2() {}
__global__ void empty_test() { empty_test2<<<1, 1, 0, cudaStreamFireAndForget>>>(); }
cudaDeviceSynchronize();
t1 = std::chrono::high_resolution_clock::now();
empty_test<<<NUM_BLOCKS, NUM_THREADS, 0>>>();
cudaDeviceSynchronize();
t2 = std::chrono::high_resolution_clock::now();
gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
std::cout << "empty duration " << gpu_duration << std::endl;
Now the times for the first call are 1405us
but they do not go down for the second call and stay at 1248us
.
So doing nothing and a FireAndForget is much slower than doing serious work. If I use cudaStreamTailLaunch
these times go up to 8000us
.
nsight-compute
shows the time spend in empty_test as very fast followed by a huge cudaDeviceSynchronize block.
I also can not see the empty_test2 kernel in the timeline that is dynamically launched.
Do you have any insights on why this is happening and how to avoid this performance loss?
I have a complex problems that requires several kernels to run concurrently and based on the output of this a recursion is done.
I want to avoid copying the output of the kernels back to the host just to check if my algorithm should proceed or not.
Here are my stats:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
on Ubuntu 23.10
Thank you for any advice on how to make use of dynamic kernel launches in a performant way.