GPU synchronization __threadfence()

tuotuo · August 3, 2010, 5:55pm

I tried to implement the GPU synchronization method introduced by " On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit " (http://synergy.cs.vt.edu/pubs/papers/xiao-icpads2009-gpu.pdf). The method is very similar with the code sample on p. 111 of Programming Guide Version 2.3.1.

I employed the synchronization function given by Figure 7 of the paper in my kernel, and it worked correctly when the dimension of the matrix was smaller than 256256. However, when the dimension of the matrix is 256256 or greater, the program seems to never stop. So I wrote a very easy function to test the synchronization function, in which 1 is added to each element of the matrix in each iteration. It couldn’t work either, when the dimension of the matrix is increased.

I have attached my code and hope someone could help. Thanks a lot.

BTW: GPU1.cu contains the main function. Device_MatrixUtilities.cu includes the synchronization function device void __GPU_sync(int goalVal), the kernel global void Test(float *U) and other related functions. The file header.h contains the definition of the dimension of the matrix, block and grid.
header.h (413 Bytes)
Device_MatrixUtilities.cu (1.73 KB)
GPU1.cu (1.86 KB)

SPWorley · August 3, 2010, 6:20pm

Oh dear. It’s just never a good idea to depend on such behavior. Really. You’re not even making a mutex, you’re trying to make a kernel-wide sync point which is an even deeper circle of Hell.

But… beyond that, you’re not even initializing your g_mutex value correctly, so its results are undefined regardless. You need to set it to 0 before you start your kernel. Right now, every block is setting it at a random time in its execution.

vvolkov · August 3, 2010, 9:14pm

I disagree. If making a kernel-wide sync point is faster than launching a new kernel you should definitely go for it.

Vasily

tmurray · August 3, 2010, 9:29pm

This may not be a terrible idea if your last name is Volkov or Murray. Otherwise, you are going to be very sad when you try to do this.

jma · August 4, 2010, 8:00am

Kernel-wide sync points works only when number_of_blocks == number_of_multiprocessors and you would like to maximize threads per block to go with that. If you have more blocks than MP’s, you will be waiting for a block that isn’t launched yet - and never will be, because you are deadlocked, waiting for it …

Running more than one block per MP (with less threads each) is not an advantage in this case because 1) the global sync point will undo any asynchronous smartness the scheduler can come up with anyway, and 2) you can’t get at the local results immediately, directly from shared memory, forcing you into the (now even more busy) loops checking for sync at an earlier point in time.

For most applications, sync by relaunch of the kernel will be the better option.

vvolkov · August 4, 2010, 8:18am

Launching a new kernel will also undo any “asynchronous smartness” of the scheduler.
You might be able to exchange data between thread blocks running on different multiprocessors using L2 cache. Possibly, it is faster than going all the way to the DRAM.

The key motivation for not launching a new kernel is its high overhead - a few microseconds. This may cost you millions of floating point operations when running at sub-teraflop rates.

Vasily

jma · August 4, 2010, 8:46am

Vasily,

Yes, but in 1) I am arguing - or intended to do so - against multiple blocks per MP combined with global sync. Kernel-relaunch is not considered at all at this point, and in 2) I am arguing that shared will be faster than global - and that you can start fetching from it directly after you have written out to global without waiting for the global sync to happen - furthermore the sync is more likely to have happened if you have some work to do before checking.

The overhead for a relaunch is higher than “a few microseconds” when you have to save and restore the state of all registers. More like 30 or 40.
Whatever speed level 2 may have, it ain’t faster than shared? :)

SPWorley · August 4, 2010, 8:52am

Even with fewer blocks than multiprocessors you can still get deadlocks… imagine running on Fermi and you have multiple kernels simultaneously executing, occupying some SMs. CUDA doesn’t guarantee your blocks will get all the MPs.

vvolkov · August 4, 2010, 9:21am

You are right, shared memory is faster. But that’s the point - shared memory is important because it enables fast inter-thread communication. Unfortunately, it is limited to communication within a multiprocessor. If we had fast global communication, we might get further speedups. I see here two research questions: (i) how fast is global communication on Fermi? (ii) what kernels can take advantage of fast global communication?

That’s easy. Just don’t run so many multiple kernels simultaneously.

If CUDA does not guarantee correct execution in this case, but you show that these guarantees are important, the guarantees will be provided in the next version of CUDA. This happens all the time. Say, two years ago we had no memory consistency model for global memory, but now we have memory fence. GPGPU was always a bleeding edge technology. Do you remember the first matrix multiply on GPU? That GPU was not even programmable! Neither it had floating point support. Larsen and McAllister did really push the envelope. They did not produce a practical code, but rather pushed the technology towards what we are having today.

Vasily

cbuchner1 · August 4, 2010, 11:42am

I think there was a thread started by cygnusX1 that proved that in some cases a previous kernel launch will determine how many active blocks the next kernel will running with. It seems that up to half the SMs may start up running idle on a kernel launch.

http://forums.nvidia.com/index.php?showtop…;show=&st=0

So even when you have all resources available to your kernel you still might not get all the SMs - and hence a deadlock when you try this kind of synchronization.

jma · August 4, 2010, 12:33pm

cbuchner1,

Cygnus breaks the one rule you have to follow in order to make global sync work reliably:

blocks == SM’s

… and SPWorley’s example is so out of control, we have no idea how many blocks are in flight anymore - so it will probably, if not to just say predictably, fail.

tmurray · August 4, 2010, 5:38pm

There is no simple assumption you can make to guarantee that global sync works reliably. Believe me on that one. This is not just “oh the execution model doesn’t support this,” I know exactly what the pitfalls are because I have looked into this at length.

jma · August 4, 2010, 7:02pm

Give an example that fails?

zeus13i · August 6, 2010, 5:01am

My algorithm requires me to sync globally and I plan on investigating the whole ‘inter vs intra’ early next year perhaps.

So is GPU-wide synchronization difficult, but possible?

seibert · August 6, 2010, 10:19am

It sounds like “possible, but cannot be guaranteed to work in all situations.”

tmurray · August 6, 2010, 5:41pm

It’s not feasible to implement robustly.

vvolkov · August 7, 2010, 2:00pm

Could you share the details?

vvolkov · August 7, 2010, 2:00pm

Could you share the details?

Topic		Replies	Views
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17410	June 14, 2008
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10001	February 29, 2008
Synchronization methods? CUDA Programming and Performance	11	2099	November 7, 2010
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
CUDA Memory Consistency CUDA Programming and Performance	23	55500	March 8, 2007
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20124	May 4, 2007
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	677	April 4, 2017
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37280	August 30, 2009
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9724	September 22, 2007
Global thread barrier CUDA Programming and Performance	78	85614	December 23, 2011

GPU synchronization __threadfence()

Related topics