CUDA hangs on GPU but not in emulation

My CUDA program runs fine in emulation mode. The problems (sometimes) come when I try to run it on the GPU.

SOMETIMES it just doesn’t run, or it takes too long to complete and it consumes 100% of the CPU. The same kernel runs without problems in emulation. Nobody else is using the GPU, and the deviceQuery of the CUDA_SDK executes correctly.

There are cases that, while the normal run of the kernel is <1s, it can take up to 1 min to get the results. BUT, if it executes once to completion then it runs normally from that point forward.

Rebooting doesn’t help.

Any ideas what could be the problem causing this behavior?

Do you have atomics? Could you be deadlocking the card? Are there race conditions?

(in other words, I can’t provide a useful answer without seeing code…)

I don’t want to explain what I am trying to do only because it is a bit complicated.
Essentially I just load from global to shared.

BTW, it just happened: I rebooted and the first couple of times I tried to run this code I terminated it because it wouldn’t complete. Finally, one of the runs took 14 sec to complete the first time, and now it runs normally in 0.067 sec.

global void
sdot2GPUSparseCAV3(float *gx, const float *ga, const float *gb, float *gres, const int n, const int Xpoints, const int XpointsYpoints, const int pitch_A, const int pitch_xbr) {

extern __shared__ float sx[];

unsigned int tid = threadIdx.x;
unsigned int bid = blockIdx.x;

int gx_base = bid*blockDim.x*rowsPerThread - Xpoints;
int gx_stop  = (bid+1)*blockDim.x*rowsPerThread + Xpoints;

gx_base = (gx_base>=0) ? gx_base : 0;
gx_stop = (gx_stop<n) ? gx_stop : n;

int sx_stop = (gx_base - Xpoints >= 0) ? rowsPerThread*blockDim.x + 2*Xpoints : rowsPerThread*blockDim.x + Xpoints; 
sx_stop = (sx_stop < n ) ? sx_stop : n;
int sx_base = 0;
while(gx_base < gx_stop){

    for(int j = 0; j < rowsPerThread; ++j){
        if(gx_base + tid <gx_stop ){
                sx[sx_base + tid] = gx[ gx_base + tid ];

// printf(“bid=%d, tid=%d j=%d,\t sx_index = %d, gx_index = %d\n”, bid, tid, j, sx_base + tid, gx_base + tid + tmp);
gx_base += blockDim.x;
sx_base += blockDim.x;

    }//for loop

}//while loop


However, if you want me to explain what I am trying to do let me know.


Looking at your code, I’m not seeing any obvious race conditions (atomics used in branches, generally). My next guess would be–are you sure you’re allocating the correct amount of memory for both shared memory and global memory?

Also, if you have a cudaThreadSynchronize() after the kernel launch, does it return cudaSuccess or something else?

Now everything works fine with the same code, until the next time it comes up. Last night for example everything was fine, and this morning I couldn’t run my code.

When I don’t allocate the right amount of memory I get segmentation faults in emulation, and I don’t think it has to do with this problem, although next time that it comes up I will try to see if this fixes it.

I was thinking of reinstalling CUDA and the drivers, because I get some problems with my screen as well. Should I install CUDA 2.0?

When you write out of bounds/allocate too little memory you might get segmentation faults in emulation mode, but it is not particularly likely.

You could check it with valgrind or some other tool for this purpose, but then there is still the question if emulation mode emulates the real hardware closely enough.

I was having some hangup issues on a GT280 with the beta version of CUDA and the corresponding driver. Updating to 2.0 final and the newest driver solved the issue. Did you check you have the latest, not yet announced, version?