[CUDA Dynamic Parallelism] How to avoid unnecessary child kernel launch overhead?

Hi, I am a beginner in CUDA.
Recently I was trying to implement BFS with CUDA Dynamic Parallelism (CDP).

if (id < TEST_NODE && Fa[id] == true && Xa[id] == true)
{
	Fa[id] = false;
	int start = Va[id].start;
	int end = start + Va[id].length;            
	if (Va[id].length >= THRESHOLD) // CDP condition
	{
		int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
		int threads = 32;
		// launch child kernel
		CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
	}
	else
	{
		for (int i = start; i < end; i++) 
		{
			int nid = Ea[i];
			if (Xa[nid] == false)
			{
				Ca[nid] = Ca[id] + 1;
				Fa[nid] = true;
				*done = false;
			}
		}
	}
}

The above code is part of the BFS kernel. When the number of neighbors of a node is greater than the given threshold, the child kernel will be launched.

	if (Va[id].length >= THRESHOLD) // THRESHOLD = 1048576
	{
		int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
		int threads = 32;
		CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
	}
	else
	{
		...; // visit neighbors w/o CDP
	}
}

However, after I turned CDP off (set the threshold to a very large number), I found that even though the conditions for CDP were never hold, the child kernel launch overhead was still there (elapsed time ≈ 125ms for BFS).

	if (Va[id].length >= THRESHOLD) // THRESHOLD = 1048576
	{
		int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
		int threads = 32;
		// CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
	}
	else
	{
		...; // visit neighbors w/o CDP
	}
}
	if (Va[id].length >= THRESHOLD) // THRESHOLD = 1048576
	{
		int num_blks = (Va[id].length / 32) + (Va[id].length % 32 != 0);
		int threads = 32;
		while(true) printf("Oh NO...");
		CUDA_BFS_NEIGHBOR_KERNEL <<<num_blks, threads>>> (Va, Ea, Fa, Xa, Ca, done, d, start, end);
	}
	else
	{
		...; // visit neighbors w/o CDP
	}
}

And when I delete the line of code that launches the child kernel, or add while(true); above it, the performance is improved. (elapsed time ≈ 25ms for BFS)

What I want to ask is:

  • Where does this launch overhead come from (when the CDP condition is never met)?
  • Can I avoid unnecessary launch overhead when I don’t want to start the kernel (the number of neighbors is less than the threshold)?

I have the same problem, hope someone can help us, thanks

I also checked the .ptx file, and there are some differences in the compiled results of the two versions. The slower version has these extra lines of code:

.extern .func  (.param .b64 func_retval0) cudaGetParameterBufferV2
(
    .param .b64 cudaGetParameterBufferV2_param_0,
    .param .align 4 .b8 cudaGetParameterBufferV2_param_1[12],
    .param .align 4 .b8 cudaGetParameterBufferV2_param_2[12],
    .param .b32 cudaGetParameterBufferV2_param_3
)
;
.extern .func  (.param .b32 func_retval0) cudaLaunchDeviceV2
(
    .param .b64 cudaLaunchDeviceV2_param_0,
    .param .b64 cudaLaunchDeviceV2_param_1
)
;

Usage of CDP (whether actually called or not) requires generation of relocatable code with device linking. Depending on your code structure and whether this is actually needed elsewhere or not, it may affect performance of “ordinary” CUDA code. There are a variety of questions about this on a variety of forums including this one.

Whether this idea is applicable or relevant in your case I cannot say based on code snippets. Certainly if we talk about deleting the line of code it may be relevant. I’m not sure about the while(true) case. Therefore I think its quite possible this is not relevant to whatever you are observing.

1 Like

Thanks for your reply!

I found some discussions about the possibility of dynamic linking. It seems that relocatable code is required.

this may be of interest.

1 Like