I have some troubles using these global barriers (link). I tried every other possible way, I found, to synchronize without global barriers (e.g. CPU implicit and explicit synchronization) but the GPU time is worst than CPU. My input data array is around 100.000 elements by the time with high possibility to increase in the future (it’s an algorithm of researchers). There is a dependency between elements as this next expression shows:
B[i] = A[i-1] + A[i] + A[i+1]
I show a little code snippet in order to offer an idea of what exactly I am trying to perform in the GPU.
//CPU code snippet
A = A;
A = A;
//The order of "times" of decens of millions"
B[j] = A[j-1] + A[j] + A[j+1];
//Efficient memory storage
aux = A;
Between each iteration I need global synchronization, in the GPU case, in order to achieve the same effect in GPUs.
My question is then. Did/Does anybody use those GPU barrier and Did/Does it go well?
Any other tricks to achieve the same purpose are welcome. I should say that the dependency must not be violated.
All restrictions the global barrier introduces are satisfied. (e.g. 1536 concurrent threads per SM)
P.S.:I am new in CUDA and excuse my concept mistakes I might make.
Measure how much time you lose in the kernel launch (compare the version with a kernel launch per iteration to one that just loops over iterations inside the kernel, possibly producing wrong results). Is it really worth trying the global synchronization without kernel launch? Relative overhead is getting even smaller with increasing problem size.
Having said that, your problem is almost ideal: Barring major hiccups, with simple round-robin scheduling of blocks the dependencies should almost always be fulfilled. You could build an array of “ready” flags in memory that indicate whether each block is finished yet (or a simplified version where just a counter is kept for finished blocks, forcing blocks to potentially spin if previous blocks haven’t finished yet). Each block could then check at the beginning whether its dependencies are fulfilled, and spin (or grab a different block to work on) if not. Instrument your kernel to see how many times blocks spin, and for how long. If the GPU block scheduler turns out not to work round-robin, build your own one using atomic operations. Once this works, go back to benchmarking: With all the extra bookkeeping, is your code actually faster than the original one that just launches a new kernel for each iteration?
Good point. Maybe the thread fence can help in telling which block is finished.
I got a little confused. I usually use i and j as indices of a 2D matrix. This a ‘time dependent’ problem in which the next step depends on the previous step. The only way is to have a kernel call for each i. The A=B is avoided by using 2 pointers and then inverted like <<<>>>(A,B) and the next <<<>>>(B,A). I suggest to use shared memory.
By the time, in the best case I got speedup x2 with the global barrier vs. launching a new kernel each iteration (without memory transfers each iteration). When I say best case I mean having 2000 elems in the input data array. Although I might have race conditions in the case of global barrier referenced in my first post.
I actually do not understand what exactly you are saying in this part, nonetheless, I think this is exactly the global barrier referenced in the first post. On the other hand I already said I got race conditions with that barrier.
The idea of this topic is to clear things related to the barrier and see why is not working properly.