Synchronization across all threads

nonlincoder · August 20, 2008, 2:46pm

I am writing a time marching finite difference program. I have two arrays, AA and BB. The data in BB can be espressed in (roughly) the following form

BB[ii] = AA[ii-1] + AA[ii] + AA[ii+1]

Let’s say I do this once, by launching a grid of blocks of threads and using the thread id as in index to the arrays. Now, I want to swap the arrays AA and BB and then repeat this process. To do this, I want to be assured that all threads in the grid (not just the block) have finished operating. Is there some sort of grid level synchronization or a hack to do it?. I read somewhere that the CUDA model does not allow grid level synchronization. Is this correct?

Currently, I execute the above process for one step on the GPU, come out on to the CPU swap the pointers of AA and BB and then restart the kernel on the GPU. I think this starting and restarting amounts to some loss of efficiency.

Is there a better way to do this?

MisterAnderson42 · August 20, 2008, 3:20pm

Only by allowing the kernel to complete executing.

The loss is something like ~10-20 microseconds for each kernel launch. For millisecond+ kernels, the overhead is essentially nothing.

tmurray · August 20, 2008, 3:34pm

Are you iterating for a fixed number of steps or until some convergence threshold is met? I think queuing kernel launches should amortize most of the overhead.

AndyL · August 20, 2008, 3:52pm

The iterative solver i’m writing goes as follows:

i=1

While value>convergence criteria

  if i = odd 

    Launch Kernel<<<>>>(A,B)

  else

    Launch Kernel<<<>>>(B,A)

  end if

  cudathreadsynchronise()

  Calculate Convergence criteria

i=i+1

Loop

Should do what you want it to i think…

nonlincoder · August 20, 2008, 5:32pm

The iterative solver i’m writing goes as follows:
i=1

While value>convergence criteria

 Â if i = odd 

 Â  Â Launch Kernel<<<>>>(A,B)

 Â else

 Â  Â Launch Kernel<<<>>>(B,A)

 Â end if

 Â cudathreadsynchronise()

 Â Calculate Convergence criteria

i=i+1

Loop
Should do what you want it to i think…

[snapback]428561[/snapback]

I am doing exactly that, but I want to offload the entire time-loop to the GPU. For this I need a grid level synchronization function, which apparently, does not exist. As I had suspected, and the second poster pointed out, the only way we can do grid level synchronization is by letting the kernel complete execution.

Reimar · August 21, 2008, 7:49am

It is a relevant overhead if the arrays a small ( < 100000 elements I would say), but the why can’t you just use e.g. 10 buffers? That would at least allow to queue things.

But no matter the size, your bottleneck is most likely global memory speed, so why not just make a kernel that calculates two iterations without using global memory that often? You can calculate the results in shared memory, calculation some of the border sums in each block.

Actually, as long as you make sure that all AA elemets you need are in shared memory, you should be limited enough by global memory bandwidth that you can just make the kernel calculate calculate 2 steps directly like:

BB[ii] = AA[ii-2] + 2AA[ii-1] + 3AA[ii] + 2*AA[ii+1] + AA[ii+2]

nonlincoder · August 21, 2008, 5:28pm

I can’t do two steps because of stability considerations (limited by the CFL). The example I gave is conceptually very similar to the PDE I am solving, but not exactly. I am not sure about what you mean when you say use “10 buffers”

Drinker · August 21, 2008, 7:56pm

CUDA doesn’t support grid level synchronization, however it can be achieved by using atomic instructions and implementing some sort of semaphore synchronization, but you have to be careful, because the number of blocks in the grid must be small enough so that all the blocks are being timesliced by the multiprocessors (You can’t miss if you make number of blocks equal to the number of multiprocessors) .
If the number of blocks is to large they will be executed serially and your apllication will fall into an infinite loop.
I implemented this kind of synchronization in may application (I was also solving PDE) and it works, (well at least the synchronization External Image , (i have yet to find some nasty memory leaks) :wacko:

And yes this method is faster than kernel level synchronization, but can be a pain in the neck implementing it…

tmurray · August 21, 2008, 8:43pm

CUDA doesn’t support grid level synchronization, however it can be achieved by using atomic instructions and implementing some sort of semaphore synchronization, but you have to be careful, because the number of blocks in the grid must be small enough so that all the blocks are being timesliced by the multiprocessors (You can’t miss if you make number of blocks equal to the number of multiprocessors) .

If the number of blocks is to large they will be executed serially and your apllication will fall into an infinite loop.

I implemented this kind of synchronization in may application (I was also solving PDE) and it works, (well at least the synchronization External Media , (i have yet to find some nasty memory leaks) :wacko:

And yes this method is faster than kernel level synchronization, but can be a pain in the neck implementing it…

[snapback]429283[/snapback]

This is a terrible idea–future cards will see almost no performance benefit compared to your existing code because you’ve targeted the algorithm specifically for that one card. Really, don’t do that.

Drinker · August 22, 2008, 7:10am

Don’t worry I am completly aware of the drawbacks that arise from writing this kind of code. I am just evaluating all the possibilities. I wrote many different versions of my application just to see what can be done and what works best.

Topic		Replies	Views
Synchronization methods? CUDA Programming and Performance	11	2120	November 7, 2010
Thread Synchronization CUDA Programming and Performance	4	8501	October 2, 2009
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	41	August 21, 2024
cant call any kernel function CUDA Programming and Performance	8	4834	June 6, 2011
'for' loop performance hacks? CUDA Programming and Performance	17	10558	February 28, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9593	January 1, 2009
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20147	May 4, 2007
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008
Global thread barrier CUDA Programming and Performance	78	85675	December 23, 2011

Synchronization across all threads

Related topics