Concurrency

I have two kernels A and B, and I want them to be executed concurrently. And also, the execution of kernel A is controlled by the return value of B. If the return value of B is 0, the execution of A is terminated immediately.
I was wondering whether the above could be realized by the stream introduced on p.34 in NVIDIA_CUDA_Programming_Guide_2.3. If this could be realized by the stream, how could I realize the communication between kernel A and B, which means A could be terminated when 0 is returned from B.
Thanks a lot !

You probably can only do it with global GPU variables. A would still have to execute, but it would read the variable set by B and do nothing if termination condition is met.

I have a median/SelectNth code that needs an unknown number of kernel launches. For now, I just do a simple read back of the result after each iteration and stop if the termination condition has been met. The synchronization isn’t good for performance because it can drain the GPU work queue, but this ~10^-5 s overhead hasn’t been a problem for me because for my median/SelectNth problem, I only need a few iterations - after the input gets reduced to some cut-off size, I just sort the numbers and finish.

I had some ideas of how to reduce synchronization overhead:

  1. Use the method you describe of speculative execution - issue multiple kernel launches between each synchronization and abort the execution if the termination condition is met.

  2. Use Tesla compute cluster driver, which has less kernel launch premium - I haven’t tried it

  3. Wait for a x86 or ARM core on the GPU, allowing faster synchronization and cooperation. No more silly kernels to initialize a GPU variable or do a final conversion - just do it on the low latency CPU, which can directly access GPU memory.

You probably can only do it with global GPU variables. A would still have to execute, but it would read the variable set by B and do nothing if termination condition is met.

I have a median/SelectNth code that needs an unknown number of kernel launches. For now, I just do a simple read back of the result after each iteration and stop if the termination condition has been met. The synchronization isn’t good for performance because it can drain the GPU work queue, but this ~10^-5 s overhead hasn’t been a problem for me because for my median/SelectNth problem, I only need a few iterations - after the input gets reduced to some cut-off size, I just sort the numbers and finish.

I had some ideas of how to reduce synchronization overhead:

  1. Use the method you describe of speculative execution - issue multiple kernel launches between each synchronization and abort the execution if the termination condition is met.

  2. Use Tesla compute cluster driver, which has less kernel launch premium - I haven’t tried it

  3. Wait for a x86 or ARM core on the GPU, allowing faster synchronization and cooperation. No more silly kernels to initialize a GPU variable or do a final conversion - just do it on the low latency CPU, which can directly access GPU memory.

Thanks a lot. I will try!

Thanks a lot. I will try!