CDP2 with dynamic grid size of the next child kernel

I’m trying to refactor my older code that used CDP1 to CDP2 with streams (cudaStreamTailLaunch). In my old code

the first child kernel’s result was used to set the grid size for the second child kernel. Can you suggest what strategy I can use with CDP2 to achieve similar results?

Andrey

this may be of interest.

one possible approach would be to use stream-ordering of the kernels (just as you have), and have the second kernel be formulated using grid-stride loop, and the loop takes its limit from a value in global memory deposited by the first kernel.

I would call it similar. of course it does not result in identical behavior.

I guess since the parent kernel is called with a <<<1,1>>> execution config (something I would basically never recommend unless performance is irrelevant), I would probably also at least consider more significant refactoring:

  • call both child kernels from a host thread. Then you can still do cudaDeviceSynchronize() if you want to. Yes, I understand that still leaves the 2nd kernel grid sizing data in device memory, you can either copy it back to the host or use the same method already suggested for refactoring (grid stride loop).
  • merge the kernels, perhaps use grid-stride loop for the launched-from-host merged kernel along with cooperative groups grid sync
  • have the first kernel called from host code, then have it do a tail launch (or even ordinary launch or fire-and-forget launch) of the second kernel once the child launch grid sizing data is arrived at - perhaps via a threadblock draining approach, if needed.

Robert,

Thank you for your quick response and for the links you provided. When I use <<<1,1>>>, I’m just trying to minimize host code to a very minimum. After I debug kernels with NSight, I use them from a large Java program using JCuda, and from Java, I do not have the luxury of debugging the kernel code. With an earlier version of CUDA, I had a problem ( Dynamic SM with Dynamic Parallelism ), so I had to use longer host code duplicating functionality from the C++ into the Java. So, my additional reason for using DP is to encapsulate the sequence of parallel jobs (with the next job parameters depending on the previous results) into a single host-side kernel call, making it universal for both C++ and Java host code.

You mentioned a bad performance of <<<1,1>>> - is this statement valid in my case where I’m trying to remove dependence on the host code? The <<<1,1>>> is just to replace the host thread, and the parent kernel does not do any calculations itself, only organizes the sequence of dependent (not only in data but in the grid dimensions as in this example) kernel launches that perform the actual work.

I do not understand - what a “grid-stride loop” is; during refactoring, I realized that the grid dimensions of the second kernel, even if they reference global memory variables, will be evaluated immediately in the parent grid, not after the first child calculates that global memory variables (unlike kernel function arguments).

I am going to add one extra level of kernels: the second child will be <<<1,1>>> and receive a pointer to the global memory variable calculated by the first child. This second child (with cudaStreamTailLaunch) will get the valid data (read from the global memory after the first child is complete) to determine the grid dimensions of the actual kernel (former second child) and launch it as a grandchild (with FireAndForget, as it will be the only one). Does it make sense?

Andrey

In the very first link I provided, the same question came up and is answered.

Yes, correct. Even with grid stride loop. That is why I said I offered it as possibly something similar (a word you used first), but I point out it is not identical behavior.

I have no experience with JCuda. A kernel launch has overhead/cost, whether you do it from host or device code. Whether that cost is significant depends on the rest of the code, both host and device. If the cost is insignificant, we need not worry about it too much. If the cost is significant, then adding additional kernel launches (say, 2, where only 1 is really needed) is not a good idea.

Robert, thank you. Yes, I missed “grid-stride loop” in the first link; now I realize I tried this approach (unrelated to DP) without knowing its name.

I would recommend JCuda ( GitHub - jcuda/jcuda: JCuda - Java bindings for CUDA ); it worked for me nicely and allowed me to incorporate CUDA into the Java host code without any extra overhead compared to C++ (the only inconvenience was the requirement to debug using a separate C++ host code). JCuda now supports CUDA 12.6, so I started refactoring my old (tested) code and plan to use this approach with CDP2 for new development. I got excellent support from JCuda maintainer - Marco Hutter.

The overall execution times of most of my kernels that iterate over large images (such as the complex lapped transform, aberration correction, and frequency-domain phase correlations) are in the range of tens of milliseconds, so the overhead of non-repetitive kernel launches is insignificant in this case.

Andrey