Threads Stuck in devce emulation

I am trying to run a kernel with grid (500,1,1) and block (500,1,1). When ran in hardware the code completes but I’m not getting the expected results when I copy the results from the device memory. I switched to the device emulation. Now, only 8 threads can pass through a __synthreads() barrier and rest of them are stuck on synthreasds. There is a piece a code before __synthreads() which copies some stuff from global memory to shared memory. If I comment out the _synthreads() statement then only those 8 threads go forward and produce results. I can be more specific with the code but I’m not clear why only 8 threads will complete with __synthreads() commented out. How many threads will CUDA generate, isn’t it 500*500? In gdb I can all those threads being created but only 8 of’em exits.

Thanks for any input.


It’s impossible to say without more detail.

But I’m going to take a wild guess and say you have a race condition. For example if you have many threads attempting to simultaneously increment a single shared or global variable, the variable may appear to get incremented only 8 times because all the threads are stepping on each other.

Just a guess. I’d really need more detail to be able to say.

The other threads do not go forward at all. Each thread is updating a separate location in global memory. During emulation only 8 threads complete while others don’t start at all for some reason. The structure of the kernel code is as follows


extern __shared__ float sData[];

tid = blockDim.x * blockIdx.x + threadIdx.x;

arrayIndex = (2 * threadIdx.x ) % nAptPos;

// gObsPoints etc. are global memory arrays populated by host code before calling the kernel

// Total shared memory used = 600*5*4 ~ 12KB

// The code below copies the same data into shared memory for every threads but each thread will copy different data

// and then the syncthreads will make more sense

for(i = 0; i < 600; i++)


    sData[i]                      = gObsPoints_x[i];

    sData[600 + i]     = gObsPoints_y[i];

    sData[2 * 600 + i] = gObsPoints_z[i];

    sData[3 * 600 + i] = gFreq[i];

    sData[4 * 600 + i] = gRange[i];


__syncthreads(); // Wait for all threads come to this point

gOutputFile_real[tid] = 0.0;

gOutputFile_img[tid] = 0.0;

for(i = 0; i < 600; i++)


 // Do some caluclation usind 5 arrays in shared memory

gOutputFIle_real[tid] += result_real;

gOutputFIle_img[tid] += result_img;


fprintf(stderr, “\n tid:: (%d) (%f %f)” tid, gOutputFile_real[tid], gOutputFile_img[tid]);



Only first 8 threads prints the last fprintf. One thing which might go wrong here is the fact that each thread is copying 12KB of same data from global memory to shared memory. As far as I recall I tried copying different data for different threads and then also it got stuck in __syncthreads(). I will double check again though.

Out of curiosity, are you just using device emulation to debug? Is there any reason why you can’t use cuda-gdb?

I thought it’s not supported on 64 bit linux but I can see it now running without a problem. I will definitely be using it to debug the problem.

After some debugging I found out that it was due to shared memory overwriting done by different threads. I switched to global memory for storing all data and it produced the correct result. It brings me to a very basic question about the shared memory. Input data required by the program can fit into the shared memory and that’s why I want them into shared memory before starting to access them. There are three basic steps for each thread to execute

  1. Copy some part of the input data to the shared memory.
  2. Sync with all other threads to make sure all input data is in the shared memory
  3. Do some calculation on all of the shared data and populate the corresponding position of the output array
    (Each output array element calculation requires all of the input data)

Now, the question how do I declare the shared data array in the kernel. Assuming there are k floats, if I declare
__shared float[k] or pass the size from the host

will it create a single array of k elements for all the threads in a block or will it try to create a different copy for each thread?


When you use shared float[k] --> k should be known in compile time.

If you want dynamic size (still limited by the 16K limitation) you should use the extern keyword (look in the SDKs for a sample)

obviously in both cases the shared array will be shared by whole threads in the block - this is the whole meaning of

shared memory. so that it is shared by all threads in a block to read.

hope this helps


I initially implemented based on that assumption but it seems something weird is going on. At least debugging with cuda-gdb shows the values are not getting written to the shared memory or someone is overwriting the values already stored. I’m pasting a debugging session to clarify what’s actually happening


Breakpoint 1, computeSAR () at

106 sData[i] = gObsPoints_x[i];

Current language: auto; currently c++

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

computeSAR () at

107 sData[nAptPos + i] = gObsPoints_y[i];

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

computeSAR () at

108 sData[2 * nAptPos + i] = gObsPoints_z[i];

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

computeSAR () at

109 sData[3 * nAptPos + i] = gFreq[i];

b p gFreq[i]

$1 = 9.28808038e+09

(cuda-gdb) p sData[3 * nAptPos + i]

$2 = 5693.09619[/b]


Even if I’m assigning the value from the global memory it’s not having the same value in shared memory. Each thread is writing onto a different location and nAptPos = 587. The sData array is of size 5 * nAptPos.