kernels communication and concurrency

Hi,

I’m working with kernels that update each other’s data for processing cooperatively until computation is done.

Current state is: one kernel busy-waits for the other that cannot reach the processors = deadlock.

Question: is there any way to guarantee some fair access of kernels to the processors (like a blocking call or something)?

Thanks in advance!

Code sample:

//-- cpu side

kernelA<<<grid, threads, 0, streamA>>>();
kernelB<<<grid, threads, 0, streamB>>>();

//-- gpu side

__global__ void kernelA(){
   int tid = computeLinearIndex();
   while( atomicAdd( &global_IsFinished, 0 ) == 0 ){//-- while computation not finished
      if( global_inputA[tid].isReady() ) { //-- check if B produced something for A to process
         ... //-- do stuff with inputA
         global_inputB[tid].updateInput(foo); //-- send something for B to process
         if( /*check if computation ends*/ ) 
            atomicAdd( &global_IsFinished, 1);
      }//-- end if
   }//-- end while
}//-- end kernelA

__global__ void kernelB(){
   int tid = computeLinearIndex();
   while( atomicAdd( &global_IsFinished, 0 ) == 0 ){//-- while computation not finished
      if( global_inputB[tid].isReady() ) { //-- check if A produced something for B to process
         ... //-- do stuff with inputB
         global_inputA[tid].updateInput(foo); //-- send something for A to process
         if( /*check if computation ends*/ ) 
            atomicAdd( &global_IsFinished, 1);
      }//-- end if
   }//-- end while
}//-- end kernelB

obs: if I move the while to the cpu, and re-launch the kernels until work is done, it works, but it is too costly (almost 2 orders of magnitude).

The processing flow looks pretty strange. Both kernels A and B send data to each other to process? Why not combine it into a single kernel?

Anyway, the way to avoid deadlock here is to ensure that blocks of both kernels can make forward progress, which amounts to reducing the grid variable to a small number of blocks for each kernel, depending on GPU and occupancy.

Alternatively, if your GPU supports dynamic parallelism, then move the work to a child kernel, and let the parent kernel be just a single block with the while loop looking for work from the other kernel. When it receives it, launch a child kernel to process it. That may not be better than the while loop on the host, however.

It’s interesting that you can somehow compute a 2 orders of magnitude slow down, when the previous state was deadlock.

Thanks, txbob:

  • kernels A and B do different stuff, so I’m using two to avoid divergence and improve parallelism within each warp…

  • agreed, my problem is to find out hot to ensure that both make progress…

  • kernels small enough (less than 1 warp each) cause deadlock eventually (not sure why GPU may be scheduling both kernels to the same SM).

  • Yes, sorry, I’m running on a GTX 980 and already tried moving the loop to a “main” kernel instead of all the way back to the GPU, but it didn’t work :(

  • If my conclusion is right, and the problem is starvation, then, by going back to the CPU, instead of busy-wait, kernelA ends and kernelB can make progress so, next time CPU launches them, A can make progress and so on…

And look at the trouble that has gotten you into. My point is that unless you know this to be a critical issue (i.e. dominating performance), it may be a misguided attempt at optimization.

Probably best if you can show a complete code. I’m not asking for your whole code, I’m asking for a simplified example that reproduces the deadlock, assuming that is the issue you are trying to address.

without subtracting from what txbob noted:

“if I move the while to the cpu, and re-launch the kernels until work is done, it works, but it is too costly”

why exactly is it costly?
and can’t you circumvent this?

“kernels small enough (less than 1 warp each) cause deadlock eventually (not sure why GPU may be scheduling both kernels to the same SM)”

i do not follow how scheduling both kernels (on the same SM) can result in deadlock…?
i am inclined to think that your definition and thus implementation of “global_IsFinished” is incorrect or incomplete
if A presumably starts the loop - rolling of the ball - and B waits for A and passes to A, and A waits for B, and passes to B, then A should be able to know/ tell whether B can potentially still pass work to it; likewise for B
atomic work counters may be the answer: if A did not pass new work to B, and the work counter of B is on 0, meaning that B has no work left, then B can not pass work to A anymore, and A (as well as B) is done, and can terminate
the same applies in the case of B: if B did not pass work to A, if B is finished such that it can not pass work to A anymore, and if the work counter of A is on 0, meaning A has no work left and can thus no longer pass work to B, B is finished and can terminate

and you can probably use atomics and dp too, to avoid busy-waiting - B can launch A, if it detects that A has finished and has terminated; and vice versa

Thanks for the comments, guys, but I know my problem is not well suited for GPU, and a single kernel would not help (deadlock could still occur due to divergence).
The whole point is to find the best way to implement it in GPU and, for that, I need the right synchronization tools, naming: blocking calls. Something that can guarantee to replace the warp that called it by another warp.
Do you guys know of something that can do that?

Yes, not being able to use most of the potential parallelism is critical :/

There are a lot of details but, the important point: I have a number of different tasks that should execute in parallel and need to communicate to make progress.

It is costly because communication between CPU and GPU is slow and launching new kernels is an overhead I would prefer to avoid.
I can circumvent this with a call that blocks the kernel, instead of using the busy-waiting approach. Do you know of such call in CUDA?

Because of the SIMT part of the GPU architecture, only threads with the same instruction can execute at the same time under a certain scheduler.
Example:

while( atomicExch(&semaphore, 0, 1) != 0 ){ //-- try to acquire ressource
   //-- do nothing: busy wait
}
   //-- got the resource
   // ... //-- do something
   atomicExch(&semaphore, 1, 0); //-- release resource

The test in the while will divide the thread in two paths: the single one who got the resource (FALSE path), and the rest (TRUE path).
If the scheduler chooses to execute the TRUE path first, all threads will get stuck, waiting for the resource to be released by that thread which the path was not scheduled to execute.
Yes, deadlock can happen with a single kernel :/

“I can circumvent this with a call that blocks the kernel”

or a device memory flag - a form of host/ device atomic
the host can always forward issue a batch of kernels, and copy the last flag back (via an event), to know when to seize forward issuing; kernels simply terminate when the flag is set
in essence, a form of kernel or stream loop

“The test in the while will divide the thread in two paths: the single one who got the resource (FALSE path), and the rest (TRUE path).
If the scheduler chooses to execute the TRUE path first, all threads will get stuck, waiting for the resource to be released by that thread which the path was not scheduled to execute.
Yes, deadlock can happen with a single kernel”

if deadlock occurs, and if it does because of the above reasons, it is through bad programming that violates core parallel programming principles
hence:
Yes, deadlock can happen with a single kernel, through bad programming; otherwise, not

little_jimmy, your arrogance baffles me!

you are not here to help… nor learn… why are you here?

Let’s be polite. I don’t see any reason to question little_jimmy’s motives. Everyone is different and you may need to become familiar with the personal style of others.

If you don’t feel the responses of someone else are helpful, the simplest course of action may be to simply ignore them.

My apologies. :(

jimmy, you are not bein helpful.
txbob, thanks for the attention.

tkx.

“jimmy, you are not bein helpful.”

why, and how come…? because i disagreed with you…?

txbob, thanks for the attention.”

“you are not here to help… nor learn… why are you here?”

on the contrary, i have just learned that, supposedly, deadlock is a natural phenomenon, and not an effect or outcome - when deadlock occurs, it is not because one placed it there or caused it (and thus can remove it just the same), it just happened
i have by now also learned that i am arrogant
hence, my purpose you inquire into, is to further observe, as i already feel (mildly) less illiterate

You probably want to think in different concepts than in the case of CPU-based parallelism.

As you experience, thread blocks cannot easily block (in the sense of waiting, no pun intended). So instead of a block waiting for it’s particular assigned data to become available, it should rather pick from the tasks that are ready to execute.

This can even be extended to let a block dynamically choose whether it wants to execute kernelA or kernelB, unless they have (very) different resource requirements.

Still. from the code I see so far it seems to me it would be easier to perform both A and B within the same thread block - they can still be assigned to different warps if you wish. Synchronisation between warps of the same block is much easier, as they are guaranteed to exist and start running at the same time. And there are dedicated directives for synchronisation between warps.