Kernel With Grid Size in GPU Memory

For an algorithm I am currently implementing, it would be very convenient to have a way to launch kernels whose grid sizes depend on a parameter held in GPU memory (that would be calculated in previous kernels), without having to pay the cost of copying the parameter back, synchronizing and launching the kernel with the desired grid size, since we are aiming for maximum performance.

As far as I understand, cudaLaunchHostFunc is entirely useless for this purpose since it is forbidden to call any CUDA function from within the enqueued host function, which makes perfect sense to maintain a deterministic order of operations.

The only way I can see to accomplish something close to this would be to resort to dynamic parallelism, launching a kernel from the host side that simply launches the desired kernel with the appropriate grid size, but this seems rather inelegant, not to mention that we have to would pay the kernel launch overhead twice. (Plus any potential performance losses from using dynamic parallelism overall.)

Given that we are dealing with a problem whose size that can go up to 2^22 = 4194304, but whose typical values can be probably 20 to 40 times less than that, launching enough threads to cover the worst-case scenario seems to be a rather wasteful alternative.

I am posting here in the hope that I have overlooked some more obscure or arcane part of the API that would provide the necessary functionality.

Thank you for taking the time to read this, and, if that is the case, for answering.

Not necessarily. Depending on the amount of work performed by the kernel and where the bottlenecks are in the kernel, the performance impact could even be minimal. This assumes that the kernel checks tests the “parameter held in GPU memory” early on in the kernel, so that threads with nothing to do terminate immediately.

One of the major incentives for adding dynamic parallelism to CUDA was precisely the flexible run-time addition of threads. But the main use case envisioned for that is when such expansion needs to happen at multiple points during runtime, e.g. processing on a gridded or triangulated surface where, for example, higher resolution is required near edges to model underlying physical boundary processes accurately.

You seem to describe a single parameter whose value is known to all threads the moment a kernel starts running. If so, the use of dynamic parallelism seems overkill here and a simple oversized grid seems worth trying.

Indeed, contrary to what some (probably flawed) initial benchmarking had led me to believe, this reduced the execution time of some steps that used some kernels like this by a factor of almost 2! Thank you so much for the suggestion.