CUDA deadlock issues in emulation mode

Aditi · May 2, 2009, 7:24am

Hi,

I have a small test code which has 1block and 300 threads in the block. Now the main step in the test kernel needs all the threads to only read-access the same memory location (same variable) from the global memory and multiplied to the thread specific variables (I believe, auto variables in registers). The execution mode gives the desired result while the code hangs in the emulation mode. I see a deadlock as a possible reason. But I have the following questions/observations:

Can even reading cause a deadlock? Or only writing to the same location should?
__synchthreads() barrier on both sides of the step hasn’t helped too :(
When I do step by step debug in VS, the code sometimes goes through in the emulation mode but with certain warnings like First-chance exception at 0x046b356b in MATLAB.exe: 0xC0000005: Access violation reading location 0x045e0100.
Each thread uses around 27 registers and 27*300=8100 is within the limit for 1.3 capability device and should otherwise would be taken care of using the global memory for any spill overs.

What should I do about it? I believe __syncthreads() should be taking care of mutexes and so should have helped but it hasn’t.

Will appreciate any tips.

Thanks,

Aditi

seibert · May 3, 2009, 1:56pm

I’m not sure I understand the question. It sounds like you may have confused the concept of “race condition” with “deadlock.” Multiple threads reading and writing the same memory location can only result in non-deterministic results if you have reads after writes (or writes after reads) without an appropriate barrier. To avoid this problem, you put a __syncthreads() between the reads and the writes that need to be ordered. __syncthreads() is not a mutex, though.

You can run into trouble if you run __syncthreads() inside a branch so that some threads in the block skip over it. That is the only way I can think of for __syncthreads() to cause your code to hang. There may be some other problem…

Aditi · May 4, 2009, 1:44pm

Thanks for your reply. Let me elaborate on my understanding: Code working in execution and not working in emulation should probably mean a deadlock since in emulation mode the threads execute sequentially. My kernel right now performs this:

[codebox]

asum = 1.0 + rowcoeffs_d[0] + rowcoeffs_d[1] + rowcoeffs_d[2] + rowcoeffs_d[3];

M1 = M2 = M3 = M4 = (double)((double)(inimage_d[Tx]<=dgrey)/asum);

for (c=0; c < numCols_d; c++) {

M0 =  (double)(inimage_d[Tx+c*numRows_d]<=dgrey) - rowcoeffs_d[0]*M1 - rowcoeffs_d[1]*M2 - rowcoeffs_d[2]*M3 - rowcoeffs_d[3]*M4;           

    tmpimage_d[(Bx*numRows_d*numCols_d)+Tx+c*numRows_d] += rowcoeffs_d[4]*M0 + rowcoeffs_d[5]*M1 + rowcoeffs_d[6]*M2 + rowcoeffs_d[7]*M3;

    M4 = M3; M3 = M2; M2 = M1; M1 = M0;

}

No. of blocks right now is 1 (i.e. Bx=0 alwasy) and number of threads is 300 (Tx is the thread ID). rowcoeffs[i], numRows_d, numCols_d, inimage_d, tmpimage_d, are all in the global memory. asum, dgrey, M0, M1, M2, M3, M4 are all thread-local variables (declared inside the global function and hence i assume are stored in registers and/or local memory of every thread).

[/codebox]

This piece of code hangs without any __synchthreads() barrier too. Please observe that there is only simultaneous reading (and no simultaneous writing) of the same memory location in calculation of asum, M0 and tmpimage_d. I think I got confused in thinking that __synchthreads() also has mutexes.

Looking at the code and the behavior in emulation and execution modes, can you please suggest what could be the possible problem and how to avoid it? I also see some memory access violation kind of warnings (like first-chance exception at 0x046b356b in MATLAB.exe: 0xC0000005: Access violation reading location 0x045e0100) too when the code runs in emulation mode (matlab.exe because i use nvmex to compile the code).

I appreciate your help.

Thanks,

Aditi

seibert · May 5, 2009, 5:10am

[codebox]

asum = 1.0 + rowcoeffs_d[0] + rowcoeffs_d[1] + rowcoeffs_d[2] + rowcoeffs_d[3];

M1 = M2 = M3 = M4 = (double)((double)(inimage_d[Tx]<=dgrey)/asum);

for (c=0; c < numCols_d; c++) {
M0 =  (double)(inimage_d[Tx+c*numRows_d]<=dgrey) - rowcoeffs_d[0]*M1 - rowcoeffs_d[1]*M2 - rowcoeffs_d[2]*M3 - rowcoeffs_d[3]*M4;           

    tmpimage_d[(Bx*numRows_d*numCols_d)+Tx+c*numRows_d] += rowcoeffs_d[4]*M0 + rowcoeffs_d[5]*M1 + rowcoeffs_d[6]*M2 + rowcoeffs_d[7]*M3;

    M4 = M3; M3 = M2; M2 = M1; M1 = M0; 
}

No. of blocks right now is 1 (i.e. Bx=0 alwasy) and number of threads is 300 (Tx is the thread ID). rowcoeffs[i], numRows_d, numCols_d, inimage_d, tmpimage_d, are all in the global memory. asum, dgrey, M0, M1, M2, M3, M4 are all thread-local variables (declared inside the global function and hence i assume are stored in registers and/or local memory of every thread).

[/codebox]

Reading and writing simultaneously can’t create a deadlock, so something else has to be wrong here. Out of curiosity, what is numCols_d set to?

Aditi · May 7, 2009, 5:45pm

Thanks for the reply. Its a 300x300image and so the numCols_d and numRows_d are both set to 300. My final code consists of eight such loops and has a total of 128 blocks of 300 threads each. It works perfectly fine in the execution mode, gives desired output and a speed up of over 5x with no special optimization (not even shared memory usage). While this code hangs in the device emulation mode. I haven’t used __syncthreads() barrier anywhere except after the end of each of such loops.

Can you guess any reason? If you want, I can even mail my code across to you. I need emulation mode to get running to be able to exploit more advanced features of CUDA for optimization. Right now I have used only the basic.

Thanks,

Aditi

NucL23 · June 9, 2009, 11:32am

Does anybody solve this problem?

I have a very similar problem to this…

With similar code,

when I read local memory (declared in a device function as: float mat[3][3]), whole the kernel which execute the device function does not run.

Even when I changed the matrix to “shared float mat[256][3][3]” with 256 threads per block, and each thread accesses “mat[tId][i][j]”,

the kernel also does not run.

What happens?

I’m using CUDA 2.0 on Windows vista 64, and Compute capability 1.3.

CUDA does NOT really support simultaneous reading?

Topic		Replies	Views
Different result between emulation and real intractable bug CUDA Programming and Performance	2	4219	December 13, 2007
Problem in EmuDebug mode CUDA Programming and Performance	0	1547	May 16, 2009
Problem in EmuDebug mode CUDA Programming and Performance	0	2791	May 16, 2009
Incorrect use of syncthreads CUDA Programming and Performance	8	4915	April 7, 2008
Threads Stuck in devce emulation CUDA Programming and Performance	7	2780	May 22, 2009
Bug in emulation mode and __syncthreads()? Kernel stops abruptly CUDA Programming and Performance	2	3005	May 14, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55938	March 8, 2007
Not able to use _syncthreads inside a loop in emulation mode But it works fine without emulation&#33 CUDA Programming and Performance	1	1101	May 5, 2009
Emulation of coalesced read CUDA Programming and Performance	1	4101	October 2, 2010
Syncthreads and Stalling Kernels CUDA Programming and Performance	16	4137	August 26, 2010

CUDA deadlock issues in emulation mode

Related topics