For an algorithm I am currently implementing, it would be very convenient to have a way to launch kernels whose grid sizes depend on a parameter held in GPU memory (that would be calculated in previous kernels), without having to pay the cost of copying the parameter back, synchronizing and launching the kernel with the desired grid size, since we are aiming for maximum performance.
As far as I understand,
cudaLaunchHostFunc is entirely useless for this purpose since it is forbidden to call any CUDA function from within the enqueued host function, which makes perfect sense to maintain a deterministic order of operations.
The only way I can see to accomplish something close to this would be to resort to dynamic parallelism, launching a kernel from the host side that simply launches the desired kernel with the appropriate grid size, but this seems rather inelegant, not to mention that we have to would pay the kernel launch overhead twice. (Plus any potential performance losses from using dynamic parallelism overall.)
Given that we are dealing with a problem whose size that can go up to 2^22 = 4194304, but whose typical values can be probably 20 to 40 times less than that, launching enough threads to cover the worst-case scenario seems to be a rather wasteful alternative.
I am posting here in the hope that I have overlooked some more obscure or arcane part of the API that would provide the necessary functionality.
Thank you for taking the time to read this, and, if that is the case, for answering.