flushing streams to forward issue conditional kernel launches

hello,

perhaps i am the only one, but i certainly think that an api function that would permit flushing a stream, would be useful

sometimes, some of the kernels to be launched are conditional on preceding kernel results
currently, the best method i can think of to forward issue such kernels, is to launch the kernels, and have the kernel blocks evaluate an atomic, to know whether they should indeed execute, or terminate prematurely
but this is wasteful in a sense - it would be a lot better if the device simply stopped scheduling blocks of these kernels
if the device can record stream events, then surely it can evaluate a stream flush flag

i am unaware of such an api call
hence, i see the following options before me:
i) persuade txbob to persuade nvidia to introduce such an api call,
ii) persuade njuffa to write such an api call,
iii) persuade txbob to persuade nvidia to persuade njuffa to write such an api call

looking forward to flushing my streams more resourcefully

Couldn’t you just use dynamic parallelism as your control?

Like, launch a single-threaded kernel that acts as “main” loop on the GPU. Would that work? Soething like,

__global__
void main(int argc, char **argv)
{
    /* ... */
   
    first_kernel<<<bpg, tpb>>>();

    if (/* ... */)
    {
        kernel_a<<<bpg, tpb>>>();
    }
    else
    {
        kernel_b<<<bpg, tpb>>>();
    }
}

i could; i could also use an atomic, and i doubt whether dp would outperform the atomic variant
also, consider the case where the kernels form an iterative loop that only breaks when a condition has been met; one could also argue that the kernels’ input parameters change with each iteration, such that intervention is required by the observer - the host or device main loop
dp seems clumsy with an iterative loop present and participation by both the ‘processor’ and ‘co-processor’
in fact, dp seems clumsy overall; and i am beginning to think dp was primarily devised to introduce recursion to gpus, and i am starting to think that recursion is leaning towards being a cpu-centric concept

i still would like to flush streams
it is like when you were small: you do not want to play with the blue toy truck, you want to play with the red toy truck

Do you have any code examples for what you’re trying to do?

@little_jimmy: I think your plan of attack needs some adjustments :-)

(a) I am not currently an NVIDIA employee, and haven’t been for a while.
(b) The one condition I had before joining the CUDA project in 2005 was “no driver work”.

That said, you should free to file an enhancement request with NVIDIA, through the bug reporting form linked from the registered developer website. Prefix the synopis with “RFE:” to mark it as an enhancement request. There are absolutely no guarantees that a particular RFE will result in a new CUDA feature, but with near certainty a desired feature will not materialize in the absence of an RFE.

perhaps

but i would have to spent a few minutes to distill it into something more of an elementary example; otherwise you would have to peruse ‘vast’ lines of code
perhaps i should rather pin something down with pseudo code

the principle is simple, though
without conditionality present, the host can minimize delays and latency, by forward issuing work, such that the device always has something in its ‘pipeline’ to draw from
(the host may further aid the device, by pre-processing some of the data as part of forward issuing work, and this should be kept in mind as well)
with conditionality present, the host can still forward issue work, for the same reasons, provided that it can inform the device of redundant work, once uncertainty is resolved; and i would think that flushing streams is the best way to inform the device, relative to atomics or dp
the cpu generally uses the same principle with branch prediction - it considers all paths, and simply flushes what is redundant when uncertainty is resolved
if a stream can be flushed, one can start to issue conditional work - when issued, it may not be clear whether the issued kernel would be necessary or redundant; it would only be resolved at a later point

good heavens, these posts should have an incoming flag, i now missed njuffa’s all important post

(a) I am not currently an NVIDIA employee, and haven’t been for a while.

and…? you are still the cuda oracle

(b) The one condition I had before joining the CUDA project in 2005 was “no driver work”.

“no driver work” does not imply you can not do driver work

and i am not much one for paperwork
my eyes dim and my hands shake when presented with red tape and forms to fill in

"Do you have any code examples for what you’re trying to do? "

optimization problems that generally tend to drift, gravitate - or even crawl - towards an optimal solution are an excellent use case example
consider the elementary case of 3 kernels - A, B, C - that must be iteratively executed 10 times
assume kernel B accepts an input array with unique data for each of its kernel blocks - each kernel block would access (x entries from) the said array via its block id, and the data is used to guide the kernel’s execution, etc; the host is responsible for preparing this array
the host can forward issue the above, such that the kernels of at least 1 iteration are in the device’s pipeline

the oversimplified pseudo code:

// toggle = toggle switch; if toggle = 0, toggle_inv = 1; if toggle = 1, toggle_inv = 0;

toggle = 0;
rnd = 0;
cnt = 0;
total = 10;

while (cnt < total)
{
   issue kernel_A();

   prepare kernel_B();

   issue kernel_B();

   issue kernel_C();

   record event[toggle];

   if (rnd > 0)
   {
      synchronize event[toggle_inv];
   }

   rnd++;
   cnt++;
   flip(toggle);
}

now assume the number of iterations is conditional
assume all 3 kernels share a common global memory array, such that the 3 kernels can equally communicate with each other
kernel C generally knows when the loop is done, and can inform the other kernels via an atomic, and the shared global memory
the host can still forward issue this, by ‘eavesdropping’ on the atomic:

// kernel C sets d_done

toggle = 0;
rnd = 0;
h_done = 0;
d_done = 0;

while (h_done == 0)
{
   issue kernel_A();

   prepare kernel_B();

   issue kernel_B();

   issue kernel_C();

   async_copy(h_done <- d_done);

   record event[toggle];

   if (rnd > 0)
   {
      synchronize event[toggle_inv];
   }

   rnd++;
   cnt++;
   flip(toggle);
}

each kernel would evaluate the atomic, to know whether it should terminate prematurely, or actually execute
this works, it is only that each kernel block needs to evaluate the atomic
if the number of kernel blocks is large, this becomes a burden
much cheaper if the host can simply inform the device to flush the stream(s)