cudaThreadSynchronize() stalls application

Hi, here is my problem: I am backporting an application written for CUDA cards with capabilities >= 1.3 to my poor Tesla C870 (capabilities 1.0). I won’t go into why I am doing that, suffice to say that’s the only tesla card I have at the moment. Cuda sdk is release 2.3, V0.2.1221.

The code I am backporting is the CUDA GPU spiking neural network simulator by J. Moorkanikara called gpusnn2. Since their code heavily uses shared atomic operations, I am using the shared op hack published on this forum here.

The problem I found, however, seems to elude the use of the hack or the nature of the application. To put it simply: the application hangs (host CPU 100%) at different locations, depending on whether I’m compiling it with make dbg=1 or not. I investigated, and the reason seems to be the use of the

cudaThreadSynchronize()

function. In fact, compiling with debug mode defines a CUTIL call into a macro which includes the chdaThreadSynchronize() call.

As a counter-proof, I tried inserting random cudaThreadSynchronize() calls into other code and it does indeed hang there, both in relase and in debug mode.

Unfortunately my GPU does not have hardware debugging capabilities, so I cannot run it with cuda-gdb. Furthermore, when compiling in device emulation the application hangs in a totally unrelated area (stating there’s a fetch from texture that failed).

I have the sneaky suspicion that my card might have some memory or core corruption. I tried compiling ocelot but due to bloody BOOST versioning issues + debian package hell the thing did not go so well.

Does anybody have a suggestion?

Is cudaThreadSynchronize() known to fail or stall the host CPU in specific reasons?

I have seen other similar issues raised here as well, but they were to no avail. What am I doing wrong?

Hmm, so the normal behavior of cudaThreadSynchronize() is to spin at 100% CPU until all the previously queued kernels have finished. My first thought is that the shared memory atomic hack you mention is making the code so slow, it takes a long time for a kernel making heavy use of it to finish. This would give the appearance of a hang at cudaThreadSynchronize(), when in reality the GPU is just still busy.

However, I assume you would have seen the same issue without cudaThreadSynchronize(), since there is an implicit sync at the next cudaMemcpy(). Does the program run fine without cudaThreadSynchronize()?

I’ve seen something similar in the past. Can’t remember the exact conditions though and no longer have access to the code for that project…

Therez a way to specify the cudaThreadSynchronize() behaviour (Spin or sleep…). I think it is cudaSetDeviceFlags or something similar. Check out.

Interesting. I tried all reasonable flags (API is here) but to no success. Thanks for the pointer, interesting “discovery”.

I understand. Unfortunately it does not seem to be a problem that has been already encountered and discussed here on the forum, or at least not in this form 'cuz I cannot find any link around here. Thanks!

Yes, the ‘maybe it’s so slow it looks like it’s hung’ indeed was my first reaction. So I let it run for a while, say a few hours (should complete in less than a couple secs). It did not improve at all. Furthermore, although my card has no hardware debugging capabilities, I tried running it with gdb (not cuda-gdb) and it turns out that threads are waiting in gettimeofday from libc, so obviously a spinlock. So sad.

Regarding the question whether the program runs fine without cudaThreadSynchronize(), I would not know - the call is an inline macro that expands in the include files, and I am not sure it will be healthy to go hack those. Mostly because there are so many calls to threadSynchronize() that I would most likely end up chasing the wrong dog.

But thanks, I really appreciate all you people’s help. Felt less lonely :)

The other possibility is that the shared memory atomic implementation you are using is flawed. Concurrency primitives are notoriously hard to get right. In fact, the comments for the shared memory atomic implementation mention a limitation: “This is a fragile hack. It uses syncthreads()! So it’s not something to mix with divergent warps.” Do you call the SharedAtomicAdd() inside an if-statement, or a loop where the exit condition can be different for different threads?

True, and indeed the warps were divergent, but let me explain - the problems I reported earlier on started before any use of such hack. That’s why I wasn’t concerned much about it. Still, as a test I tried lowering the grid and block dimension (down to 1x1…) and now things are different. I pass the thread synchronize point, and now it fails with a Runtime API error … unspecified launch failure.

So we’re in a different scenario, and before adding anything else I need to keep investigating this.

Thanks for everything.

I am having a similar problem actually with the use of cudaThreadSynchronize and my program halting. In my case, I think it has to do with memory copies from device to host. In my program I have a pair of asynchronous memory copies (through the use of a host mapped memory) and a synchronous copy from device to host. They are all reasonably sized copies, no more than 200 elements per cycle copied. If I use the cudaThreadSynchronize directive at the end of each cycle, the program will sometimes just completely stall at the very the first cycle. If I don’t, my program will go on for a few cycles but it will eventually hang up at the synchronous cudaMemcpy at some point during execution. The only synchronous memory copy is actually a one element wide copying, so I don’t think it is hanging because of the memcpy itself.

Same here. I tried lowering the grid and block size down to much smaller levels, and it passed the stalling point.