Global sync barrier problem Xiao, Feng global barrier isn't working as expected

Kia_Morot · March 7, 2012, 11:02am

Hello everyone,

I have some troubles using these global barriers (link). I tried every other possible way, I found, to synchronize without global barriers (e.g. CPU implicit and explicit synchronization) but the GPU time is worst than CPU. My input data array is around 100.000 elements by the time with high possibility to increase in the future (it’s an algorithm of researchers). There is a dependency between elements as this next expression shows:

B[i] = A[i-1] + A[i] + A[i+1]

I show a little code snippet in order to offer an idea of what exactly I am trying to perform in the GPU.

//CPU code snippet

//Boundary conditions

A[0] = A[1];

A = A;

//The order of "times" of decens of millions"

for (i=0;i<times;i++)

{

 for (j=0;j<size;j++)

 {

  B[j] = A[j-1] + A[j] + A[j+1];

 }

 //Efficient memory storage

 aux = A;

 A=B;

 B=aux;

}

Between each iteration I need global synchronization, in the GPU case, in order to achieve the same effect in GPUs.

My question is then. Did/Does anybody use those GPU barrier and Did/Does it go well?

Any other tricks to achieve the same purpose are welcome. I should say that the dependency must not be violated.

All restrictions the global barrier introduces are satisfied. (e.g. 1536 concurrent threads per SM)

P.S.:I am new in CUDA and excuse my concept mistakes I might make.

tera · March 7, 2012, 12:14pm

Launch a new kernel for each iteration.

Measure how much time you lose in the kernel launch (compare the version with a kernel launch per iteration to one that just loops over iterations inside the kernel, possibly producing wrong results). Is it really worth trying the global synchronization without kernel launch? Relative overhead is getting even smaller with increasing problem size.

Having said that, your problem is almost ideal: Barring major hiccups, with simple round-robin scheduling of blocks the dependencies should almost always be fulfilled. You could build an array of “ready” flags in memory that indicate whether each block is finished yet (or a simplified version where just a counter is kept for finished blocks, forcing blocks to potentially spin if previous blocks haven’t finished yet). Each block could then check at the beginning whether its dependencies are fulfilled, and spin (or grab a different block to work on) if not. Instrument your kernel to see how many times blocks spin, and for how long. If the GPU block scheduler turns out not to work round-robin, build your own one using atomic operations. Once this works, go back to benchmarking: With all the extra bookkeeping, is your code actually faster than the original one that just launches a new kernel for each iteration?

EDIT: fix typo

pasoleatis · March 7, 2012, 2:02pm

Try the threadfence functions

tera · March 7, 2012, 3:13pm

How are threadfence functions going to help here?

pasoleatis · March 7, 2012, 3:57pm

Good point. Maybe the thread fence can help in telling which block is finished.

I got a little confused. I usually use i and j as indices of a 2D matrix. This a ‘time dependent’ problem in which the next step depends on the previous step. The only way is to have a kernel call for each i. The A=B is avoided by using 2 pointers and then inverted like <<<>>>(A,B) and the next <<<>>>(B,A). I suggest to use shared memory.

Kia_Morot · March 8, 2012, 10:05am

By the time, in the best case I got speedup x2 with the global barrier vs. launching a new kernel each iteration (without memory transfers each iteration). When I say best case I mean having 2000 elems in the input data array. Although I might have race conditions in the case of global barrier referenced in my first post.

Having said that, your problem is almost ideal: Barring major hiccups, with simple round-robin scheduling of blocks the dependencies should almost always be fulfilled. You could build an array of “ready” flags in memory that indicate whether each block is finished yet (or a simplified version where just a counter is kept for finished blocks, forcing blocks to potentially spin if previous blocks haven’t finished yet). Each block could then check at the beginning whether its dependencies are fulfilled, and spin (or grab a different block to work on) if not. Instrument your kernel to see how many times blocks spin, and for how long. If the GPU block scheduler turns out not to work round-robin, build your own one using atomic operations. Once this works, go back to benchmarking: With all the extra bookkeeping, is your code actually faster than the original one that just launches a new kernel for each iteration?

EDIT: fix typo

I actually do not understand what exactly you are saying in this part, nonetheless, I think this is exactly the global barrier referenced in the first post. On the other hand I already said I got race conditions with that barrier.

The idea of this topic is to clear things related to the barrier and see why is not working properly.

Topic		Replies	Views
Global thread barrier CUDA Programming and Performance	78	85802	December 23, 2011
global barrier synchronization CUDA Programming and Performance	1	2847	July 22, 2009
A global barrier for blocks the barrier is failing... CUDA Programming and Performance	4	1621	February 6, 2010
global synchronization needed Legacy PGI Compilers	1	2265	March 26, 2012
memory sync problem writing to global memory within kernel iterations CUDA Programming and Performance	4	1878	February 22, 2010
GPU synchronization __threadfence() CUDA Programming and Performance	17	3411	August 7, 2010
CUDA GPU and CPU synchronisation - how? How to make the CPU wait for all GPU threads without ending CUDA Programming and Performance	0	7780	December 5, 2010
Having trouble with inter-block communication Legacy PGI Compilers	4	3149	August 1, 2011
Any way to guarentee writes have made it to global memory? CUDA Programming and Performance	1	712	September 23, 2009
Thread Synchronization CUDA Programming and Performance	4	8506	October 2, 2009

Global sync barrier problem Xiao, Feng global barrier isn't working as expected

Related topics