Threads Stuck in devce emulation

shibdas · May 20, 2009, 6:33am

Hi,
I am trying to run a kernel with grid (500,1,1) and block (500,1,1). When ran in hardware the code completes but I’m not getting the expected results when I copy the results from the device memory. I switched to the device emulation. Now, only 8 threads can pass through a __synthreads() barrier and rest of them are stuck on synthreasds. There is a piece a code before __synthreads() which copies some stuff from global memory to shared memory. If I comment out the _synthreads() statement then only those 8 threads go forward and produce results. I can be more specific with the code but I’m not clear why only 8 threads will complete with __synthreads() commented out. How many threads will CUDA generate, isn’t it 500*500? In gdb I can all those threads being created but only 8 of’em exits.

Thanks for any input.

Shibdas

Jamie_K · May 20, 2009, 1:34pm

It’s impossible to say without more detail.

But I’m going to take a wild guess and say you have a race condition. For example if you have many threads attempting to simultaneously increment a single shared or global variable, the variable may appear to get incremented only 8 times because all the threads are stepping on each other.

Just a guess. I’d really need more detail to be able to say.

shibdas · May 20, 2009, 4:36pm

The other threads do not go forward at all. Each thread is updating a separate location in global memory. During emulation only 8 threads complete while others don’t start at all for some reason. The structure of the kernel code is as follows

[codebox]

…

extern __shared__ float sData[];

tid = blockDim.x * blockIdx.x + threadIdx.x;

arrayIndex = (2 * threadIdx.x ) % nAptPos;

// gObsPoints etc. are global memory arrays populated by host code before calling the kernel

// Total shared memory used = 600*5*4 ~ 12KB

// The code below copies the same data into shared memory for every threads but each thread will copy different data

// and then the syncthreads will make more sense

for(i = 0; i < 600; i++)

{

    sData[i]                      = gObsPoints_x[i];

    sData[600 + i]     = gObsPoints_y[i];

    sData[2 * 600 + i] = gObsPoints_z[i];

    sData[3 * 600 + i] = gFreq[i];

    sData[4 * 600 + i] = gRange[i];

}

__syncthreads(); // Wait for all threads come to this point

gOutputFile_real[tid] = 0.0;

gOutputFile_img[tid] = 0.0;

for(i = 0; i < 600; i++)

{

 // Do some caluclation usind 5 arrays in shared memory

gOutputFIle_real[tid] += result_real;

gOutputFIle_img[tid] += result_img;

}

fprintf(stderr, “\n tid:: (%d) (%f %f)” tid, gOutputFile_real[tid], gOutputFile_img[tid]);

}

[/codebox]

Only first 8 threads prints the last fprintf. One thing which might go wrong here is the fact that each thread is copying 12KB of same data from global memory to shared memory. As far as I recall I tried copying different data for different threads and then also it got stuck in __syncthreads(). I will double check again though.

tmurray · May 20, 2009, 5:18pm

Out of curiosity, are you just using device emulation to debug? Is there any reason why you can’t use cuda-gdb?

shibdas · May 20, 2009, 5:33pm

I thought it’s not supported on 64 bit linux but I can see it now running without a problem. I will definitely be using it to debug the problem.

shibdas · May 21, 2009, 7:44pm

After some debugging I found out that it was due to shared memory overwriting done by different threads. I switched to global memory for storing all data and it produced the correct result. It brings me to a very basic question about the shared memory. Input data required by the program can fit into the shared memory and that’s why I want them into shared memory before starting to access them. There are three basic steps for each thread to execute

Copy some part of the input data to the shared memory.
Sync with all other threads to make sure all input data is in the shared memory
Do some calculation on all of the shared data and populate the corresponding position of the output array
(Each output array element calculation requires all of the input data)

Now, the question how do I declare the shared data array in the kernel. Assuming there are k floats, if I declare
__shared float[k] or pass the size from the host

will it create a single array of k elements for all the threads in a block or will it try to create a different copy for each thread?

eyalhir74 · May 21, 2009, 7:53pm

Hi,

When you use shared float[k] → k should be known in compile time.

If you want dynamic size (still limited by the 16K limitation) you should use the extern keyword (look in the SDKs for a sample)

obviously in both cases the shared array will be shared by whole threads in the block - this is the whole meaning of

shared memory. so that it is shared by all threads in a block to read.

hope this helps

eyal

shibdas · May 22, 2009, 12:32am

I initially implemented based on that assumption but it seems something weird is going on. At least debugging with cuda-gdb shows the values are not getting written to the shared memory or someone is overwriting the values already stored. I’m pasting a debugging session to clarify what’s actually happening

[codebox]

Breakpoint 1, computeSAR () at SAR_kernel.cu:106

106 sData[i] = gObsPoints_x[i];

Current language: auto; currently c++

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

computeSAR () at SAR_kernel.cu:107

107 sData[nAptPos + i] = gObsPoints_y[i];

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

computeSAR () at SAR_kernel.cu:108

108 sData[2 * nAptPos + i] = gObsPoints_z[i];

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

computeSAR () at SAR_kernel.cu:109

109 sData[3 * nAptPos + i] = gFreq[i];

b p gFreq[i]

$1 = 9.28808038e+09

(cuda-gdb) p sData[3 * nAptPos + i]

$2 = 5693.09619[/b]

[/codebox]

Even if I’m assigning the value from the global memory it’s not having the same value in shared memory. Each thread is writing onto a different location and nAptPos = 587. The sData array is of size 5 * nAptPos.

Topic		Replies	Views
One thread write into shared memory has limit? CUDA Programming and Performance	4	677	July 28, 2022
Problem when using more than 64 threads per block CUDA Programming and Performance	3	4337	May 4, 2010
Simple Thread Problem CUDA Programming and Performance	1	4077	September 24, 2009
Different result between emulation and real intractable bug CUDA Programming and Performance	2	4219	December 13, 2007
Shared Memory Problems - __syncthreads() doesn't work? CUDA Programming and Performance	5	2675	December 29, 2011
__syncthreads and shared memory CUDA Programming and Performance	21	4627	June 15, 2011
CUDA BUG? Shared memory contents differ across threads __syncthreads() not working??? CUDA Programming and Performance	1	1895	September 10, 2009
Shared Memory Problem memory shared only within blocks? CUDA Programming and Performance	4	6015	February 8, 2008
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6960	May 26, 2011
__syncthreads() + shared memory issue CUDA Programming and Performance	7	5704	August 26, 2008

Threads Stuck in devce emulation

Related topics