Synchronization across all threads

I am writing a time marching finite difference program. I have two arrays, AA and BB. The data in BB can be espressed in (roughly) the following form

BB[ii] = AA[ii-1] + AA[ii] + AA[ii+1]

Let’s say I do this once, by launching a grid of blocks of threads and using the thread id as in index to the arrays. Now, I want to swap the arrays AA and BB and then repeat this process. To do this, I want to be assured that all threads in the grid (not just the block) have finished operating. Is there some sort of grid level synchronization or a hack to do it?. I read somewhere that the CUDA model does not allow grid level synchronization. Is this correct?

Currently, I execute the above process for one step on the GPU, come out on to the CPU swap the pointers of AA and BB and then restart the kernel on the GPU. I think this starting and restarting amounts to some loss of efficiency.

Is there a better way to do this?

Only by allowing the kernel to complete executing.

The loss is something like ~10-20 microseconds for each kernel launch. For millisecond+ kernels, the overhead is essentially nothing.

Are you iterating for a fixed number of steps or until some convergence threshold is met? I think queuing kernel launches should amortize most of the overhead.

The iterative solver i’m writing goes as follows:

i=1

While value>convergence criteria

  if i = odd 

    Launch Kernel<<<>>>(A,B)

  else

    Launch Kernel<<<>>>(B,A)

  end if

  cudathreadsynchronise()

  Calculate Convergence criteria

i=i+1

Loop

Should do what you want it to i think…

I am doing exactly that, but I want to offload the entire time-loop to the GPU. For this I need a grid level synchronization function, which apparently, does not exist. As I had suspected, and the second poster pointed out, the only way we can do grid level synchronization is by letting the kernel complete execution.

It is a relevant overhead if the arrays a small ( < 100000 elements I would say), but the why can’t you just use e.g. 10 buffers? That would at least allow to queue things.

But no matter the size, your bottleneck is most likely global memory speed, so why not just make a kernel that calculates two iterations without using global memory that often? You can calculate the results in shared memory, calculation some of the border sums in each block.

Actually, as long as you make sure that all AA elemets you need are in shared memory, you should be limited enough by global memory bandwidth that you can just make the kernel calculate calculate 2 steps directly like:

BB[ii] = AA[ii-2] + 2AA[ii-1] + 3AA[ii] + 2*AA[ii+1] + AA[ii+2]

I can’t do two steps because of stability considerations (limited by the CFL). The example I gave is conceptually very similar to the PDE I am solving, but not exactly. I am not sure about what you mean when you say use “10 buffers”

CUDA doesn’t support grid level synchronization, however it can be achieved by using atomic instructions and implementing some sort of semaphore synchronization, but you have to be careful, because the number of blocks in the grid must be small enough so that all the blocks are being timesliced by the multiprocessors (You can’t miss if you make number of blocks equal to the number of multiprocessors) .
If the number of blocks is to large they will be executed serially and your apllication will fall into an infinite loop.
I implemented this kind of synchronization in may application (I was also solving PDE) and it works, (well at least the synchronization External Image , (i have yet to find some nasty memory leaks) :wacko:

And yes this method is faster than kernel level synchronization, but can be a pain in the neck implementing it…

This is a terrible idea–future cards will see almost no performance benefit compared to your existing code because you’ve targeted the algorithm specifically for that one card. Really, don’t do that.

Don’t worry I am completly aware of the drawbacks that arise from writing this kind of code. I am just evaluating all the possibilities. I wrote many different versions of my application just to see what can be done and what works best.