cudaThreadSynchronize() stalls application

lorenzogrespan · November 11, 2009, 3:39pm

Hi, here is my problem: I am backporting an application written for CUDA cards with capabilities >= 1.3 to my poor Tesla C870 (capabilities 1.0). I won’t go into why I am doing that, suffice to say that’s the only tesla card I have at the moment. Cuda sdk is release 2.3, V0.2.1221.

The code I am backporting is the CUDA GPU spiking neural network simulator by J. Moorkanikara called gpusnn2. Since their code heavily uses shared atomic operations, I am using the shared op hack published on this forum here.

The problem I found, however, seems to elude the use of the hack or the nature of the application. To put it simply: the application hangs (host CPU 100%) at different locations, depending on whether I’m compiling it with make dbg=1 or not. I investigated, and the reason seems to be the use of the

cudaThreadSynchronize()

function. In fact, compiling with debug mode defines a CUTIL call into a macro which includes the chdaThreadSynchronize() call.

As a counter-proof, I tried inserting random cudaThreadSynchronize() calls into other code and it does indeed hang there, both in relase and in debug mode.

Unfortunately my GPU does not have hardware debugging capabilities, so I cannot run it with cuda-gdb. Furthermore, when compiling in device emulation the application hangs in a totally unrelated area (stating there’s a fetch from texture that failed).

I have the sneaky suspicion that my card might have some memory or core corruption. I tried compiling ocelot but due to bloody BOOST versioning issues + debian package hell the thing did not go so well.

Does anybody have a suggestion?

Is cudaThreadSynchronize() known to fail or stall the host CPU in specific reasons?

I have seen other similar issues raised here as well, but they were to no avail. What am I doing wrong?

seibert · November 12, 2009, 5:41pm

Hmm, so the normal behavior of cudaThreadSynchronize() is to spin at 100% CPU until all the previously queued kernels have finished. My first thought is that the shared memory atomic hack you mention is making the code so slow, it takes a long time for a kernel making heavy use of it to finish. This would give the appearance of a hang at cudaThreadSynchronize(), when in reality the GPU is just still busy.

However, I assume you would have seen the same issue without cudaThreadSynchronize(), since there is an implicit sync at the next cudaMemcpy(). Does the program run fine without cudaThreadSynchronize()?

Tigga · November 12, 2009, 5:43pm

I’ve seen something similar in the past. Can’t remember the exact conditions though and no longer have access to the code for that project…

Sarnath · November 13, 2009, 5:07am

Therez a way to specify the cudaThreadSynchronize() behaviour (Spin or sleep…). I think it is cudaSetDeviceFlags or something similar. Check out.

lorenzogrespan · November 16, 2009, 8:23am

Interesting. I tried all reasonable flags (API is here) but to no success. Thanks for the pointer, interesting “discovery”.

lorenzogrespan · November 16, 2009, 8:24am

I understand. Unfortunately it does not seem to be a problem that has been already encountered and discussed here on the forum, or at least not in this form 'cuz I cannot find any link around here. Thanks!

lorenzogrespan · November 16, 2009, 8:30am

Yes, the ‘maybe it’s so slow it looks like it’s hung’ indeed was my first reaction. So I let it run for a while, say a few hours (should complete in less than a couple secs). It did not improve at all. Furthermore, although my card has no hardware debugging capabilities, I tried running it with gdb (not cuda-gdb) and it turns out that threads are waiting in gettimeofday from libc, so obviously a spinlock. So sad.

Regarding the question whether the program runs fine without cudaThreadSynchronize(), I would not know - the call is an inline macro that expands in the include files, and I am not sure it will be healthy to go hack those. Mostly because there are so many calls to threadSynchronize() that I would most likely end up chasing the wrong dog.

But thanks, I really appreciate all you people’s help. Felt less lonely :)

seibert · November 16, 2009, 3:50pm

Yes, the ‘maybe it’s so slow it looks like it’s hung’ indeed was my first reaction. So I let it run for a while, say a few hours (should complete in less than a couple secs). It did not improve at all. Furthermore, although my card has no hardware debugging capabilities, I tried running it with gdb (not cuda-gdb) and it turns out that threads are waiting in gettimeofday from libc, so obviously a spinlock. So sad.

Regarding the question whether the program runs fine without cudaThreadSynchronize(), I would not know - the call is an inline macro that expands in the include files, and I am not sure it will be healthy to go hack those. Mostly because there are so many calls to threadSynchronize() that I would most likely end up chasing the wrong dog.

But thanks, I really appreciate all you people’s help. Felt less lonely :)

The other possibility is that the shared memory atomic implementation you are using is flawed. Concurrency primitives are notoriously hard to get right. In fact, the comments for the shared memory atomic implementation mention a limitation: “This is a fragile hack. It uses syncthreads()! So it’s not something to mix with divergent warps.” Do you call the SharedAtomicAdd() inside an if-statement, or a loop where the exit condition can be different for different threads?

lorenzogrespan · November 16, 2009, 5:18pm

True, and indeed the warps were divergent, but let me explain - the problems I reported earlier on started before any use of such hack. That’s why I wasn’t concerned much about it. Still, as a test I tried lowering the grid and block dimension (down to 1x1…) and now things are different. I pass the thread synchronize point, and now it fails with a Runtime API error … unspecified launch failure.

So we’re in a different scenario, and before adding anything else I need to keep investigating this.

Thanks for everything.

jjtapiav · November 16, 2009, 11:18pm

I am having a similar problem actually with the use of cudaThreadSynchronize and my program halting. In my case, I think it has to do with memory copies from device to host. In my program I have a pair of asynchronous memory copies (through the use of a host mapped memory) and a synchronous copy from device to host. They are all reasonably sized copies, no more than 200 elements per cycle copied. If I use the cudaThreadSynchronize directive at the end of each cycle, the program will sometimes just completely stall at the very the first cycle. If I don’t, my program will go on for a few cycles but it will eventually hang up at the synchronous cudaMemcpy at some point during execution. The only synchronous memory copy is actually a one element wide copying, so I don’t think it is hanging because of the memcpy itself.

lorenzogrespan · November 17, 2009, 8:17am

I am having a similar problem actually with the use of cudaThreadSynchronize and my program halting. In my case, I think it has to do with memory copies from device to host. In my program I have a pair of asynchronous memory copies (through the use of a host mapped memory) and a synchronous copy from device to host. They are all reasonably sized copies, no more than 200 elements per cycle copied. If I use the cudaThreadSynchronize directive at the end of each cycle, the program will sometimes just completely stall at the very the first cycle. If I don’t, my program will go on for a few cycles but it will eventually hang up at the synchronous cudaMemcpy at some point during execution. The only synchronous memory copy is actually a one element wide copying, so I don’t think it is hanging because of the memcpy itself.

Same here. I tried lowering the grid and block size down to much smaller levels, and it passed the stalling point.

Topic		Replies	Views
Program hangs at cudaThreadsynchronize CUDA Programming and Performance	12	9685	April 7, 2011
cudaThreadSynchronize() stalls? CUDA Programming and Performance	2	9014	January 8, 2008
race conditioning stopping hang time CUDA Programming and Performance	5	2456	September 4, 2008
Code hangs... CUDA Programming and Performance	24	20054	August 18, 2010
Very strange problem on CUDA CUDA Programming and Performance	7	1234	June 15, 2010
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4359	September 22, 2009
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6285	May 26, 2009
cudaThreadSynchronize() does not make the CPU wait CUDA Programming and Performance	1	1825	March 3, 2010
A very strange problem on CUDA CUDA Programming and Performance	4	6012	June 15, 2010
cudaThreadSynchronize() error CUDA Programming and Performance	1	2977	October 5, 2009

cudaThreadSynchronize() stalls application

Related topics