CUDA deadlock issues in emulation mode


I have a small test code which has 1block and 300 threads in the block. Now the main step in the test kernel needs all the threads to only read-access the same memory location (same variable) from the global memory and multiplied to the thread specific variables (I believe, auto variables in registers). The execution mode gives the desired result while the code hangs in the emulation mode. I see a deadlock as a possible reason. But I have the following questions/observations:

  1. Can even reading cause a deadlock? Or only writing to the same location should?
  2. __synchthreads() barrier on both sides of the step hasn’t helped too :(
  3. When I do step by step debug in VS, the code sometimes goes through in the emulation mode but with certain warnings like First-chance exception at 0x046b356b in MATLAB.exe: 0xC0000005: Access violation reading location 0x045e0100.
  4. Each thread uses around 27 registers and 27*300=8100 is within the limit for 1.3 capability device and should otherwise would be taken care of using the global memory for any spill overs.

What should I do about it? I believe __syncthreads() should be taking care of mutexes and so should have helped but it hasn’t.

Will appreciate any tips.



I’m not sure I understand the question. It sounds like you may have confused the concept of “race condition” with “deadlock.” Multiple threads reading and writing the same memory location can only result in non-deterministic results if you have reads after writes (or writes after reads) without an appropriate barrier. To avoid this problem, you put a __syncthreads() between the reads and the writes that need to be ordered. __syncthreads() is not a mutex, though.

You can run into trouble if you run __syncthreads() inside a branch so that some threads in the block skip over it. That is the only way I can think of for __syncthreads() to cause your code to hang. There may be some other problem…

Thanks for your reply. Let me elaborate on my understanding: Code working in execution and not working in emulation should probably mean a deadlock since in emulation mode the threads execute sequentially. My kernel right now performs this:


asum = 1.0 + rowcoeffs_d[0] + rowcoeffs_d[1] + rowcoeffs_d[2] + rowcoeffs_d[3];

M1 = M2 = M3 = M4 = (double)((double)(inimage_d[Tx]<=dgrey)/asum);

for (c=0; c < numCols_d; c++) {

M0 =  (double)(inimage_d[Tx+c*numRows_d]<=dgrey) - rowcoeffs_d[0]*M1 - rowcoeffs_d[1]*M2 - rowcoeffs_d[2]*M3 - rowcoeffs_d[3]*M4;           

    tmpimage_d[(Bx*numRows_d*numCols_d)+Tx+c*numRows_d] += rowcoeffs_d[4]*M0 + rowcoeffs_d[5]*M1 + rowcoeffs_d[6]*M2 + rowcoeffs_d[7]*M3;

    M4 = M3; M3 = M2; M2 = M1; M1 = M0; 


No. of blocks right now is 1 (i.e. Bx=0 alwasy) and number of threads is 300 (Tx is the thread ID). rowcoeffs[i], numRows_d, numCols_d, inimage_d, tmpimage_d, are all in the global memory. asum, dgrey, M0, M1, M2, M3, M4 are all thread-local variables (declared inside the global function and hence i assume are stored in registers and/or local memory of every thread).


This piece of code hangs without any __synchthreads() barrier too. Please observe that there is only simultaneous reading (and no simultaneous writing) of the same memory location in calculation of asum, M0 and tmpimage_d. I think I got confused in thinking that __synchthreads() also has mutexes.

Looking at the code and the behavior in emulation and execution modes, can you please suggest what could be the possible problem and how to avoid it? I also see some memory access violation kind of warnings (like first-chance exception at 0x046b356b in MATLAB.exe: 0xC0000005: Access violation reading location 0x045e0100) too when the code runs in emulation mode (matlab.exe because i use nvmex to compile the code).

I appreciate your help.



Reading and writing simultaneously can’t create a deadlock, so something else has to be wrong here. Out of curiosity, what is numCols_d set to?

Thanks for the reply. Its a 300x300image and so the numCols_d and numRows_d are both set to 300. My final code consists of eight such loops and has a total of 128 blocks of 300 threads each. It works perfectly fine in the execution mode, gives desired output and a speed up of over 5x with no special optimization (not even shared memory usage). While this code hangs in the device emulation mode. I haven’t used __syncthreads() barrier anywhere except after the end of each of such loops.

Can you guess any reason? If you want, I can even mail my code across to you. I need emulation mode to get running to be able to exploit more advanced features of CUDA for optimization. Right now I have used only the basic.



Does anybody solve this problem?

I have a very similar problem to this…

With similar code,

when I read local memory (declared in a device function as: float mat[3][3]), whole the kernel which execute the device function does not run.

Even when I changed the matrix to “shared float mat[256][3][3]” with 256 threads per block, and each thread accesses “mat[tId][i][j]”,

the kernel also does not run.

What happens?

I’m using CUDA 2.0 on Windows vista 64, and Compute capability 1.3.

CUDA does NOT really support simultaneous reading?