GPU synchronization __threadfence()

I tried to implement the GPU synchronization method introduced by " On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit " (http://synergy.cs.vt.edu/pubs/papers/xiao-icpads2009-gpu.pdf). The method is very similar with the code sample on p. 111 of Programming Guide Version 2.3.1.

I employed the synchronization function given by Figure 7 of the paper in my kernel, and it worked correctly when the dimension of the matrix was smaller than 256256. However, when the dimension of the matrix is 256256 or greater, the program seems to never stop. So I wrote a very easy function to test the synchronization function, in which 1 is added to each element of the matrix in each iteration. It couldn’t work either, when the dimension of the matrix is increased.

I have attached my code and hope someone could help. Thanks a lot.

BTW: GPU1.cu contains the main function. Device_MatrixUtilities.cu includes the synchronization function device void __GPU_sync(int goalVal), the kernel global void Test(float *U) and other related functions. The file header.h contains the definition of the dimension of the matrix, block and grid.
header.h (413 Bytes)
Device_MatrixUtilities.cu (1.73 KB)
GPU1.cu (1.86 KB)

Oh dear. It’s just never a good idea to depend on such behavior. Really. You’re not even making a mutex, you’re trying to make a kernel-wide sync point which is an even deeper circle of Hell.

But… beyond that, you’re not even initializing your g_mutex value correctly, so its results are undefined regardless. You need to set it to 0 before you start your kernel. Right now, every block is setting it at a random time in its execution.

I disagree. If making a kernel-wide sync point is faster than launching a new kernel you should definitely go for it.

Vasily

This may not be a terrible idea if your last name is Volkov or Murray. Otherwise, you are going to be very sad when you try to do this.

Kernel-wide sync points works only when number_of_blocks == number_of_multiprocessors and you would like to maximize threads per block to go with that. If you have more blocks than MP’s, you will be waiting for a block that isn’t launched yet - and never will be, because you are deadlocked, waiting for it …

Running more than one block per MP (with less threads each) is not an advantage in this case because 1) the global sync point will undo any asynchronous smartness the scheduler can come up with anyway, and 2) you can’t get at the local results immediately, directly from shared memory, forcing you into the (now even more busy) loops checking for sync at an earlier point in time.

For most applications, sync by relaunch of the kernel will be the better option.

  1. Launching a new kernel will also undo any “asynchronous smartness” of the scheduler.

  2. You might be able to exchange data between thread blocks running on different multiprocessors using L2 cache. Possibly, it is faster than going all the way to the DRAM.

The key motivation for not launching a new kernel is its high overhead - a few microseconds. This may cost you millions of floating point operations when running at sub-teraflop rates.

Vasily

Vasily,

Yes, but in 1) I am arguing - or intended to do so - against multiple blocks per MP combined with global sync. Kernel-relaunch is not considered at all at this point, and in 2) I am arguing that shared will be faster than global - and that you can start fetching from it directly after you have written out to global without waiting for the global sync to happen - furthermore the sync is more likely to have happened if you have some work to do before checking.

The overhead for a relaunch is higher than “a few microseconds” when you have to save and restore the state of all registers. More like 30 or 40.
Whatever speed level 2 may have, it ain’t faster than shared? :)

Even with fewer blocks than multiprocessors you can still get deadlocks… imagine running on Fermi and you have multiple kernels simultaneously executing, occupying some SMs. CUDA doesn’t guarantee your blocks will get all the MPs.

You are right, shared memory is faster. But that’s the point - shared memory is important because it enables fast inter-thread communication. Unfortunately, it is limited to communication within a multiprocessor. If we had fast global communication, we might get further speedups. I see here two research questions: (i) how fast is global communication on Fermi? (ii) what kernels can take advantage of fast global communication?

That’s easy. Just don’t run so many multiple kernels simultaneously.

If CUDA does not guarantee correct execution in this case, but you show that these guarantees are important, the guarantees will be provided in the next version of CUDA. This happens all the time. Say, two years ago we had no memory consistency model for global memory, but now we have memory fence. GPGPU was always a bleeding edge technology. Do you remember the first matrix multiply on GPU? That GPU was not even programmable! Neither it had floating point support. Larsen and McAllister did really push the envelope. They did not produce a practical code, but rather pushed the technology towards what we are having today.

Vasily

I think there was a thread started by cygnusX1 that proved that in some cases a previous kernel launch will determine how many active blocks the next kernel will running with. It seems that up to half the SMs may start up running idle on a kernel launch.

http://forums.nvidia.com/index.php?showtop…;show=&st=0

So even when you have all resources available to your kernel you still might not get all the SMs - and hence a deadlock when you try this kind of synchronization.

cbuchner1,

Cygnus breaks the one rule you have to follow in order to make global sync work reliably:

blocks == SM’s

… and SPWorley’s example is so out of control, we have no idea how many blocks are in flight anymore - so it will probably, if not to just say predictably, fail.

There is no simple assumption you can make to guarantee that global sync works reliably. Believe me on that one. This is not just “oh the execution model doesn’t support this,” I know exactly what the pitfalls are because I have looked into this at length.

Give an example that fails?

My algorithm requires me to sync globally and I plan on investigating the whole ‘inter vs intra’ early next year perhaps.

So is GPU-wide synchronization difficult, but possible?

It sounds like “possible, but cannot be guaranteed to work in all situations.”

It’s not feasible to implement robustly.

Could you share the details?

Could you share the details?