Orchestrate blocks on CPU during GPU kernel run

Question
In case of cooperative kernel that is working (many iteration cycles with grid sync), sending new data on the GPU, will not couse any problem? - I need to orchestrate some work during kernel execution using CPU
I suppose in order to make it work i would need some asynchronous memory access, yet i do not intend to allocate memory just modify some values in it.

To clarify CPU will be solely responsible to write the data and GPU to read, or vice versa so there should not be data race between those two

Context
I have couple hundreds of data blocks
size of those data blocks are such that one block of threads can process it
only some data blocks are in need of processing at given iteration
after all data blocks get processed i need to synchronize grid (grid sync from cooperative groups)
than work on new set of currently active data blocks

Now which data blocks will be active for processing or not is semi random hence tread block would need to check weather data block is active before proceeding, some block of threads would need to scan through the data and look for active data block, than using atomics mark that this one is currently under processing so other block would not duplicate work…

This seems wastefully and task to distribute work across blocks seems to be better suited for CPU, i would just have CPU managed list of data to work on associated with each block of threads,

Yet I wonder weather in case of cooperative kernel that is working, such messing with data, and sending new data on the GPU, in time when GPU is working will not couse any problem?

It should be possible. Sending new data to a kernel while the kernel is running is possible. Since everything is asynchronous, you’ll need an appropriate handshake mechanism, perhaps with double-buffering, to enable reliable communication between host and device.

And this probably has performance implications. And I personally wouldn’t bother trying to make this work on windows WDDM GPU. A possible approach might be like this:

  1. Create two buffers in pinned memory. Have a handshake/mailbox for each buffer (also in pinned memory) that indicates the state of the buffer.
  2. In your kernel, have the buffer processing code guarded by grid sync. Within the guarded region, have one threadblock be the master that will consume the buffer, and update the mailbox. Ping-pong between the two buffers so that the CPU can be updating one buffer while the other one is being accessed by kernel code.
  3. Outside of the guarded region in the kernel, put your data processing code, that uses the buffer “acquired” during the guarded region.
  4. Depending on how you consume the data some of it may be needed to be marked as “volatile” - particularly e.g. the mailbox.

You can possibly get some ideas from this, although that is asynchronously communicating status in the other direction - from device to host, yet many of the ideas are applicable for either direction of communication.

1 Like