Reusing cudaGraphInstance with dynamic parallelism causes error 4(launchFailure)


I am using CUDA 10.0 and its graph features and want to add a kernel using dynamic parallelism to the graph:

#include <cuda.h>
#include <cuda_runtime.h>
#include <cstdint>
#include <cstdio>
#include <cuda_device_runtime_api.h>
__global__ void childKernel(){
    if(threadIdx.x == 0 && blockIdx.x == 0 && blockIdx.y == 0){
        printf("hello from childKernel\n");

__global__ void parentKernel() { 
    childKernel<<<dim3(60000,10), 64>>>();

    auto ret = cudaDeviceSynchronize();
    if(ret != cudaSuccess){
        printf("CudaStreamSynchronize failed with %i",ret);

int main() {
  cudaGraph_t graph;
  cudaGraphNode_t node;
  cudaKernelNodeParams params;
  params.func = (void*) parentKernel;
  params.extra = nullptr;
  params.gridDim = dim3(1);
  params.blockDim = dim3(1);
  params.sharedMemBytes = 0;
  params.kernelParams = nullptr;

  cudaGraphExec_t instance;
  cudaStream_t myStream;

  for(int i = 0; i < 100000; ++i){
    auto err = cudaStreamSynchronize(myStream);
    if (err != cudaSuccess) {
      printf("CUDA Error %d occured\n", err);



  return 0;

The output of this program is:

hello from childKernel
hello from childKernel
CUDA Error 4 occured

The first graph-launch completes successfully, but the second one fails.

For the example+cmake script, please see

I tried this with a GeForce RTX2060 on both Linux and Windows and get the same output.

If i do one of the following steps, the error vanishes:

  • remove cudaDeviceSynchronize() from the parent kernel
  • create a new cudaGraphInstance for every graph-launch,
  • make the launch-size of the child-kernel smaller (e.g. remove the second dimension)

Are limits for child-kernels stricter than the general limit? But if so, how does this only happen if I reuse a graph instance?

Any advice, tips or experience would be greatly appreciated as neither of the fixes I found are viable in my application.



my suggestion would be to file a bug. Normally I would be skeptical of very large numbers of child-kernel launches due to dynamic parallelism limits such as the launch pending limit, however your test case appears to work for me on CUDA 11.1 if I don’t use graphs. Both cuda graphs and dynamic parallelism use up GPU resources to support them, so there may be some limit that is exceeded when you combine them with the specific dimensions you are using.

Thanks, i will file a bug. Without using cudaGraph, I also do not get the error. I don’t really understand how such a small example can exceed limits as its really only one child kernel, but a big one. Increasing cudaLimitDevRuntimePendingLaunchCount also does not solve the issue.

Sorry, disregard those statements. Yes, it is only one child kernel being launched. I got confused.