Performance drops with dynamic parallelism

I modified the 0_Introduction/clock example to test my performance losses on a complex problem when using Dynamic Parallelism.
I observed huge time delays when launching kernels from within kernels and I am trying to understand why this is happening.

I added the -rdc=true flag to the nvcc command line to enable relocatable device code, which is required for Dynamic Parallelism.
And then I measured the elapsed time for different kernel executions.

  cudaDeviceSynchronize();
  std::chrono::high_resolution_clock::time_point t1, t2, t3;
  t1 = std::chrono::high_resolution_clock::now();
  timedReduction<<<NUM_BLOCKS, NUM_THREADS, sizeof(float) * 2 * NUM_THREADS>>>(
      dinput, doutput, dtimer);
  cudaDeviceSynchronize();
  t2 = std::chrono::high_resolution_clock::now();
  int gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
  std::cout << "timedReduction duration " << gpu_duration << std::endl;

As you can see, I synchronized the device before starting the timer, and then I measured the time after the kernel execution including the synchronization.

The first execution of this code gives me an output of
2529us and a second call to the same code gives me an output of 6us.
So I guess some kind of kernel compilation/preparation is done.

The next test measures calling an empty kernel:

__global__ void empty_test2() {}

  cudaDeviceSynchronize();
  t1 = std::chrono::high_resolution_clock::now();
  empty_test2<<<NUM_BLOCKS, NUM_THREADS, 0>>>();
  cudaDeviceSynchronize();
  t2 = std::chrono::high_resolution_clock::now();
  gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
  std::cout << "empty2 duration " << gpu_duration << std::endl;

The output for this code is 11us and 4us on the second call. Pretty normal so far.

When I now introduce dynamic parallelism by calling a kernel from within a kernel, the times go through the roof:

__global__ void empty_test2() {}
__global__ void empty_test() { empty_test2<<<1, 1, 0, cudaStreamFireAndForget>>>(); }

  cudaDeviceSynchronize();
  t1 = std::chrono::high_resolution_clock::now();
  empty_test<<<NUM_BLOCKS, NUM_THREADS, 0>>>();
  cudaDeviceSynchronize();
  t2 = std::chrono::high_resolution_clock::now();
  gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
  std::cout << "empty duration " << gpu_duration << std::endl;

Now the times for the first call are 1405us but they do not go down for the second call and stay at 1248us.
So doing nothing and a FireAndForget is much slower than doing serious work. If I use cudaStreamTailLaunch
these times go up to 8000us.
nsight-compute shows the time spend in empty_test as very fast followed by a huge cudaDeviceSynchronize block.
I also can not see the empty_test2 kernel in the timeline that is dynamically launched.

Do you have any insights on why this is happening and how to avoid this performance loss?
I have a complex problems that requires several kernels to run concurrently and based on the output of this a recursion is done.
I want to avoid copying the output of the kernels back to the host just to check if my algorithm should proceed or not.

Here are my stats:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
on Ubuntu 23.10

Thank you for any advice on how to make use of dynamic kernel launches in a performant way.

1 Like

To my knowledge, the cost of launching a kernel from device code is about the same as launching a kernel from the host, that is, they both utilize the same hardware mechanism.

Separate compilation can have a significant negative performance impact when compared to whole-program compilation. This is due to the overhead of following ABI requirements for all function calls, and the inability of the compiler to apply certain optimizations. You would want to turn on link-time optimization ( --dlink-time-opt) when using -rdc=true.

I don’t have an explanation for this observation; maybe someone else does. Double-check the compilation settings to ensure that you are doing an optimized release build rather without any debug flags, and that you are compiling for the correct target architecture.

so, in the dynamic parallelism case, you are launching 64x256 = 16,384 child kernels (one per parent kernel thread). It’s not clear what your expectations are exactly, but if I had to guess you seem to have the latency of a single kernel launch in mind.

The device certainly does not have the capability to run 16,384 kernels at the same time. The maximum number of resident grids for most discrete GPUs since Volta is 128. If we imagined a perfect set of waves, that would require 128 waves (of 128 grids each) to tackle 16,384 kernel launches. That is 1248us/128 or less than 10us per wave. That 10us is on the order of the minimum kernel launch duration (launch latency plus execution time) which is probably in the 2-5us range depending on how it is measured.

From my perspective, the GPU is processing your waves of empty kernels pretty fast.

FWIW I ran your fire and forget test case in two conditions:

  1. as-is, so just measuring the single launch as you have shown it

  2. with an extra launch before the first cudaDeviceSynchronize() - so measuring the time of the second invocation of empty_test()

on my L4 GPU (you don’t seem to mention what GPU you are running on) and an H100 GPU, and I
get:

         duration:
L4 case 1:   ~10000us
L4 case 2:      717us
H100 case 1: ~50000us
H100 case 2:    728us

(CUDA 12.2)

So whereas your first measurement of 1405us makes me quite envious, the second measurement of 1248us seems on the high side, but it may well depend on the GPU you are running on, and perhaps CUDA version. Regarding the first measurement, CUDA lazy loading kicked in at the CUDA 12.2 timeframe, and your compiler indication is CUDA 12.0, so for me in my CUDA 12.2 setup this (case 1 measurement above) seems to be a lazy loading artifact.

Here is an example test case on L4, CUDA 12.2:

# cat t202.cu
#include <chrono>
#include <iostream>

__global__ void empty_test2() {}
__global__ void empty_test() { empty_test2<<<1, 1, 0, cudaStreamFireAndForget>>>(); }
const int NUM_BLOCKS = 64;
const int NUM_THREADS = 256;

int main(int argc, char *argv[]){
  if (argc > 1) empty_test<<<NUM_BLOCKS, NUM_THREADS, 0>>>();
  cudaDeviceSynchronize();
  auto t1 = std::chrono::high_resolution_clock::now();
  empty_test<<<NUM_BLOCKS, NUM_THREADS, 0>>>();
  cudaDeviceSynchronize();
  auto t2 = std::chrono::high_resolution_clock::now();
  auto gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
  std::cout << "empty duration " << gpu_duration << std::endl;
}
# nvcc -o t202 t202.cu -arch=sm_89 -rdc=true
# compute-sanitizer ./t202
========= COMPUTE-SANITIZER
empty duration 19982
========= ERROR SUMMARY: 0 errors
# ./t202
empty duration 9789
# ./t202 1
empty duration 750
# CUDA_MODULE_LOADING=EAGER ./t202
empty duration 828
#

So that works out to about 6us per wave.

I think both nsys and ncu have limitations when it comes to CDP profiling, so that may be expected behavior.

I would hesitate to proceed very far into that before the above information is considered. However I will say that a general CUDA principle is that a kernel launch of <<<1,1>>> is never efficient and will never unlock the performance potential of the GPU. In any case where you care about performance, launching a kernel of <<<1,1>>> is immediately a red flag. Yes, I get it, I understand this is just a demonstrator. But the point is that you should attempt to use the GPU efficiently. A large number of very small kernel launches (e.g. even <<<32,32>>> would be questionably small, nowadays) is not the way to get best performance out of a GPU. Does your algorithm really require each thread to launch its own child kernel? From a parent kernel of 16,384 or more threads? Remember, a kernel likes to have (generally) a lot of data to work on using a lot of threads (for the purpose of latency hiding, as well as for launch overhead amortization). CDP in no way obviates that. Does each thread really prepare enough data to make a kernel launch an appropriate way to handle the work?

I wouldn’t want to gloss over that. It might be your most important and relevant observation. A CDP launch per thread is nowhere near as efficient as processing in a single thread, when the type/order of the processing per thread is such as it is in the timedReduction kernel.

Launching kernels with dynamic parallelism has two useful properties: It gets the task done with potentially more resources (e.g. threads) than in the calling context and it gives the means for synchronization, when the work is finished.

So there are very few examples, where it would make sense to run a small kernel with Dynamic Parallelism instead of just doing the work directly.

Which of the two useful properties was your main reason to choose Dynamic Parallelism?

One of the standard examples for the use of dynamic parallelism is (or at least was when this feature was new) adaptive mesh refinement. In other words, dynamic parallelism adds flexibility. How does that use case fall into either of the two enumerated categories of useful properties? I am a bit puzzled at the moment.

@Robert_Crovella Thank you very much for your useful input. You are completely right, and I did a mistake. In this example I was supposed to call empty_test and empty_test2 with only 1 block and 1 thread.
The updated times are:

timedReduction duration 2514
timedReduction duration 7
empty2 duration 12
empty2 duration 6
empty duration 59
empty duration 22

While these timings are not as extrem, you can still see quite some overhead for the dynamic case.
empty_duration takes 22us to start 2 kernels ( the parent and then the child).
If I do this without dynamic parallelism like so

  empty_test2<<<1, 1, 0>>>();
  empty_test2<<<1, 1, 0>>>();
  empty_test2<<<1, 1, 0>>>();
  empty_test2<<<1, 1, 0>>>();
  cudaDeviceSynchronize();

it only takes 1us-2us per additional call to an empty kernel, while for every empty_test2 call I add in the dynamic parallelism case it adds 7us-10us overhead for each kernel call, like here:

__global__ void empty_test() {
  empty_test2<<<1, 1, 0, cudaStreamFireAndForget>>>();
  empty_test2<<<1, 1, 0, cudaStreamFireAndForget>>>();
  empty_test2<<<1, 1, 0, cudaStreamFireAndForget>>>();
}

And they could even run in parallel and should be basically free.
My GPU is a NVIDIA RTX A500 Laptop GPU.

I am using dynamic parallelism to implement some algorithm logic on the gpu, so I use it for the means of synchronization.
I have a fixpoint algorithm that I want to implement, and here is some pseudocode for this:

__global__ void propagate_control(MySolver solver);

__global__ void propagate_end(MySolver solver) {
  
  if (foo)
  {
    if (bar) return;
    analyze<<<1, 1,  something, cudaStreamTailLaunch>>>(solver);
    return;
  }
  else if (foobar)
  {
    propagate_control<<<1, 1, 0, cudaStreamTailLaunch>>>(solver);
  }
}

__global__ void propagate_control(MySolver solver) {
  binary_propagation<<<8, 32, 0, cudaStreamFireAndForget>>>(solver);
  nary_propagation<<<8, 32, 0, cudaStreamFireAndForget>>>(solver);
  valid_propagation<<<1, 32, 0, cudaStreamTailLaunch>>>(solver);
  propagate_end<<<1, 1, 0, cudaStreamTailLaunch>>>(solver);
}

I start the kernel propagate_control with 1 block and 1 thread.
The workhorses are binary_propagate and nary_propagate which can both be run in parallel (fireAndForget). Afterwards I need to consolidate the data with validate_propagation, this is why I use a TailLaunch.
After this validation step I want to decide if I continue the propagation or not.
I need an assistance kernel propagate_end as a tail launch to ensure that valid_propagation is actually finished. Then i can check if I start over with a next iteration of propagate_control. All kernels modify some global data in solver.

I experience a huge overhead in this construct as every kernel call in this dynamic parallization construct costs me an overhead of maybe 10us. The runtime seems to be dominated by this overhead and not the actual work in the kernels. For example 10 recursive iterations each calling 5 kernels is 500us overhead in this case, just for algorithmic control.
Is there a way to avoid these overheads? I read about cuda graphs with conditions and maybe I can create a recursive graph for my needs that does not have this huge overhead and also does not need the -rdl option?

@njuffa
Thanks for looking into this. I tried your option, here is my compile call:

/usr/bin/nvcc -ccbin g++ -I../../../Common -rdc=true --dlink-time-opt -m64 --threads 0 --std=c++11 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o clock.o -c clock.cu

But I do get an error: nvcc fatal : '-dlto' conflicts with '-gencode' to control what is generated; use 'code=lto_<arch>' with '-gencode' instead of '-dlto' to request lto intermediate and I’m not sure what to do about it.

Thank you everybody for taking the time to look into my problem.

I would say invoking the sub-kernel enables to do more work than the calling thread could do by itself?

You are welcome to formulate the reasons for dynamic parallelism differently.

I just guess, if the invoked grid*block size is small, dynamic parallelism perhaps is not the perfect tool.

Even if - for better memory accesses and parallel computations - you want to invoke a warp of 32 threads for some single threads, you probably could do it easier reshuffling and serializing some work in the current warp from those threads to the whole warp instead of invoking a kernel.

The error message includes instructions on how to proceed. The error message refers to -dlto, which is the short form of the --dlink-time-opt switch. So you would do something like this (example from Windows, thus .obj and .exe):

nvcc -c -rdc=true -gencode arch=compute_61,code=lto_61 -gencode arch=compute_75,code=lto_75 -o foo.obj foo.cu
nvcc -dlto -gencode arch=compute_61,code=sm_61 -gencode arch=compute_75,code=sm_75 -o foo.exe foo.obj

When compiling, one specifies that LTO-enabled object files are to be generated for the selected architectures, and at the link stage one then specifies the target architecture(s) for the binary executable in the conventional way.

I don’t know why nvcc operates in this way, possibly to allow one to specify targets with and without LTO support in the same compiler invocation, instead of enabling LTO for all specified targets.

To avoid misunderstandings: I was not questioning the validity of the classification mechanism offered. But this scheme was new to me, and I was simply wondering how it applies to the (common?) case of adaptive mesh refinement.

You’re stacking up latencies of empty kernels. I’m sure that seems like a sensible test measurement to you, but you will never get around latency that way. The GPU is a latency-hiding machine. That means you must give it lots of other work to do, beside the latency intensive work, so that it can hide latency. The general idea is covered in unit 3 of this online training series, and is the general motivation around the idea of giving a kernel launch lots of threads with lots of work to do.

how would we apply that to your fire and forget test case? If you don’t ever attempt to give the machine anything else to do, then you are just benchmarking latencies. However if we give the machine something else to work on, and size our work appropriately, a fire and forget launch could be “almost” free:

# cat t203.cu
#include <chrono>
#include <iostream>

using mt = float;
const int nTPBx = 32;
const int nTPBy = 16;
const int nBLKx = 10;
const int nBLKy = 2*nBLKx;
const int dN = nTPBx*nTPBy*nBLKx*nBLKy;
__device__ mt dA[dN];
__device__ mt dB[dN];
__device__ mt dC[dN];

template <typename T>
__device__ void mm(const int idx, const int idy, const T *A, const T *B, float *C, const int N){
  if ((idx < N) && (idy < N)){
    mt sum = 0;
    for (int i = 0; i < N; i++) sum += A[idy*N+i] * B[i*N+idx];
    C[idy*N+idx] = sum;}
}

__global__ void non_empty_test2(int myN) {

  int idx = blockIdx.x*blockDim.x+threadIdx.x;
  int idy = blockIdx.y*blockDim.y+threadIdx.y;
  mm(idx, idy, dA, dB, dC, myN);}

__global__ void non_empty_test(bool spawn) {
  int idx = blockIdx.x*blockDim.x+threadIdx.x;
  int idy = blockIdx.y*blockDim.y+threadIdx.y;
  if ((spawn) && (idx == 0) && (idy == 0)) non_empty_test2<<<dim3(1,2), dim3(nBLKx,nBLKy), 0, cudaStreamFireAndForget>>>(nBLKx);
  mm(idx, idy, dA, dB, dC, nTPBx*nBLKx);
}

int main(int argc, char *argv[]){
  bool myspawn = false;
  if (argc > 1) myspawn = true;
  non_empty_test<<<dim3(nBLKx, nBLKy), dim3(nTPBx, nTPBy)>>>(myspawn);
  cudaDeviceSynchronize();
  auto t1 = std::chrono::high_resolution_clock::now();
  non_empty_test<<<dim3(nBLKx, nBLKy), dim3(nTPBx, nTPBy)>>>(myspawn);
  cudaDeviceSynchronize();
  auto t2 = std::chrono::high_resolution_clock::now();
  auto gpu_duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
  std::cout << "duration " << gpu_duration << std::endl;
}

# nvcc -o t203 t203.cu -rdc=true -arch=sm_89
# compute-sanitizer ./t203 1
========= COMPUTE-SANITIZER
duration 11066
========= ERROR SUMMARY: 0 errors
# ./t203
duration 53
# ./t203 1
duration 55
# ./t203
duration 53
# ./t203
duration 55
# ./t203 1
duration 54
#

CUDA 12.2, L4 GPU

Your A500 GPU is considerably smaller than my L4 (16 SMs vs. 58) so I’m not suggesting you will see exactly the same numbers. You might need to fiddle with nBLKx to see the best case on your GPU. But if you make nBLKx really small, you will once again be in the same domain where you are just looking at latency, and the launch with the spawn will be noticeably longer (almost 2x) than the launch without the child kernel spawn.

A tail launch behaves differently and is serving a different need. Whereas the fire-and-forget launch allows that work to run concurrently with the parent kernel, and therefore the parent kernel can in some cases help with hiding the effect of the child kernel (both launch latency as well as execution duration, in some cases) the tail launch doesn’t afford exactly the same opportunities. You perhaps may now have a framework to explore that and make discoveries.

It was something I spontaneously came up, when answering the question: I was thinking, what use cases DD could have, when there could be an alternative direct function call and why one could use dynamic parallelism with such a small task size.

There is certainly an intersection with the idea of synchronization and adaptive mesh refinement.

@Robert_Crovella Thank you very much. I see that in your example you can do more work in the same time. I just have to figure out how to adapt this to my needs. I guess my uncontrolled workload is too short, so my intermediate kernel calls are dominated by overhead.
I will try to use a cudaGraph to see if it reduces overhead, but I’m in big doubt as I do not think that it works much differently from dynamic parallelism.

So I guess from there its maybe just trying to build a bigger kernel that does more different things that hopefully take longer and having block0/thread0 doing the control of execution. I fear that this will lead to quite some divergency, but I have no experience with this yet. I will even try to have different blocks in the same kernel do completely different things, and synchronize afterwards for control.

Thanks a lot for these insights, I really appreciate all your help.
It would also be cool if dynamic parallelism could be correctly shown in nsight, as right now only the calling kernel is shown with no means of looking what happens inside.