Hi, here is my problem: I am backporting an application written for CUDA cards with capabilities >= 1.3 to my poor Tesla C870 (capabilities 1.0). I won’t go into why I am doing that, suffice to say that’s the only tesla card I have at the moment. Cuda sdk is release 2.3, V0.2.1221.
The code I am backporting is the CUDA GPU spiking neural network simulator by J. Moorkanikara called gpusnn2. Since their code heavily uses shared atomic operations, I am using the shared op hack published on this forum here.
The problem I found, however, seems to elude the use of the hack or the nature of the application. To put it simply: the application hangs (host CPU 100%) at different locations, depending on whether I’m compiling it with make dbg=1 or not. I investigated, and the reason seems to be the use of the
cudaThreadSynchronize()
function. In fact, compiling with debug mode defines a CUTIL call into a macro which includes the chdaThreadSynchronize() call.
As a counter-proof, I tried inserting random cudaThreadSynchronize() calls into other code and it does indeed hang there, both in relase and in debug mode.
Unfortunately my GPU does not have hardware debugging capabilities, so I cannot run it with cuda-gdb. Furthermore, when compiling in device emulation the application hangs in a totally unrelated area (stating there’s a fetch from texture that failed).
I have the sneaky suspicion that my card might have some memory or core corruption. I tried compiling ocelot but due to bloody BOOST versioning issues + debian package hell the thing did not go so well.
Does anybody have a suggestion?
Is cudaThreadSynchronize() known to fail or stall the host CPU in specific reasons?
I have seen other similar issues raised here as well, but they were to no avail. What am I doing wrong?