Good way to do sync between/across cores and threadblocks?

I need some way to sync between cores, as I’m going to implement a successive relaxtion algorithm (linear solver).
I’m planning to divide the input into smaller sections and need a way to communicate data between these sections.

Even though the overhead of starting kernels is small, I’m thinking that the best performance would be gained by using a single kernel and handle the sync in that kernel.
(As opposed to launching a new kernel for each step in the iterative algorithm).

What would be a good way to achieve this?

  1. Streams? (Haven’t worked with them yet, but are they easy to use for this? I.e. is there a “stream_in_read()” kind of function that blocks until something is available to read on a specific stream?)
  2. Atomic read/writes of flags in global memory? (In this way I could use these flags to do communication between cores when data are ready)
  3. Several kernel launches? (If this is generally the way to go…)
  4. Other?

Several kernel launches, potentially using streams and cudaStreamWaitEvent.

By doing manual inter-block synchronization you will gain maximum 2microseconds comparing to multiple kernel launch. Unless your logic is really that fast and 2mcs overhead is significant, you shouldn’t bother with manual block synchronization, and simply do multiple kernel launches.

Inter-block synchronization has plenty of other caveats. Don’t do that.

Thank you very much for your replies!

I now have problems with an implementation, because I thought data was persistent in shared memory (per core), but in retrospect I guess it makes good sense that shared memory is wiped when a new kernel is launched (is this was actually happens? And can it be avoided?).

I have an application with the same number of blocks as there are physical cores. Is there a way I can keep data in shared memory across kernel invocations, or do I have to store and reread the content in shared memory everytime I call i new kernel??

I’m calling the same kernel (6 blocks, 32 threads) many many times in a loop, and if I have to swap memcontent between shared memory and device/global memory for every call, I’m afraid my design was doomed from the beginning…

Is it possible in some way to keep shared memory intact across kernelcalls?

No it is not possible. Maybe you can combine your kernels into smaller number of kernels ?

It sounds like you are really underutilizing the GPU with only 6 blocks and 32 threads per block. What GPU do you have?

I’m using a GeForce GT325M.

I’m trying to implement an SOR red-black algorithm (my second application ever with CUDA, so I’m still new) and I guess my design is way off… :)

Anyways - It seems that what I should do is to split my input matrix up into many blocks and transfer whole blocks between global and shared memory for every kernel invocation.

My fear is that, since I need to synchronize my red-black iterations (by using kernel invocations as it is), there will be a great number of memory transfers between global and shared memory. I’m comparing to a seriel version I made running on the CPU and it seems that many memory transfers like this will quickly make the CUDA version undesireable with this sort of synhronization mechanism.

Do any of you know if this sounds like a problem with the way I’m synchronizing between iterations?

The code is basicly:





stuff…//Calculate the new largest varience


For something like a red-black GS iterator, using separate kernels for the red and black passes is really the best idea. The code probably winds up being memory bandwidth limited, but the overhead of buffering the grid in shared memory can be amortized by having each block solve more than one “subdomain” per call. For example, if you are solving a 3D problem, lauch a 2D grid and have each block “walk” through the third dimension. That increases the FLOP count per memory transaction considerably.


I’ll do another implementation of 5-point stencil red-black using a 2d dimgrid and 2d blockgrid (I did’nt use 2d grids before, so I had an overhead in computing indexes I’m afraid…).

Then I’ll try to transfer blocks back and forth between global and shared memory to see the performance difference.

Does it sound like a good approach?

Thanks for the answers!

Do you think it would be necesaary to use a kind of doublebuffering/asynchrous memory transfers between global and shared memory?

If so, is that a relatively easy thing to accomplish?

This reminds me of the N-body GPU Gems paper, I found it well worth understanding the approach used.

I’ll try to look at that. Thanks.

Is the following correctly understood:

If I launch a kernel with a 2d dimgrid and 2d blockgrid, then every block will do all it’s work before being rescheduled? (I’m not doing anything explicit to halt execution - i.e. waiting for something to happen. I’m simply doing a 5-point stencil for each block).
It’s important because I swap an amount of data between global and shared memory for each block.

For each block, this is what I plan to do (let’s assume each threadIdx.x spans 4 floats/16 bytes and each threadIdx.y spans 1 float and that the kernel is launched with a blockDim and gridDim that covers all data in the input matrix).

//Move data from global til shared memeory - each thread copies it’s own four bytes - this should total all data for the current block
for-loop that copies 4 bytes from global to shared memory indexed by blockIdx and threadIdx information.

//Do operation on data in shared memeory
for-loop doing a red or black iteration accross the data in shared memory - reading values from global memory for the border cases

//Move data back to global memory - again each thread moves four bytes, which should total the movement of the whole block of data when the code is done executing for the current block.

I guess my questions can be broken down to these:

  1. Can I be sure that all threads (in 2d) belonging to the same block are always synchronized regarding intructions (so that I know that all data is present in the shared memory when I’m doing 5-point stencil operations on it in the middle part of the pseudocode)? Just making sure… :)

  2. Can I be sure that blocks are not swapped out in the middle of execution, which would potentially cause incorret data to be present in the shared memory?

  3. Is it a good idea to let each thread of the block copy it’s own four bytes of data, or is it better to let fewer threads copy more data? I guess this would be the case if each memory transfer has some constant timefactor attached to it.

I now have a SOR red-black implementation that seriously beats my host system :-)

Looks like I used way to few threads (and blocks) before, thereby underutilizing the hardware.

Thanks to all who replied :-)