[cuda-gdb] strange local var value in different threads of the same kernel call

I’ve been having problems debugging several of my cuda kernel calls using cuda-gdb, as in the following code sample, numThreads and threadID don’t seem to have the correct assigned values across different threads.

[codebox]

global void ChopScale(cufftComplex * d_RawData,

                      const unsigned int sample_cnt)

{

const int numThreads = blockDim.x * gridDim.x;

const int threadID = blockIdx.x * blockDim.x + threadIdx.x;

for (unsigned int i=threadID; i<sample_cnt; i+=numThreads)

{

        d_RawData[i].x = d_RawData[i].x * 2;

        d_RawData[i].y = d_RawData[i].y * 2;

}

__syncthreads(); // wait for all threads

}

[/codebox]

I launch the kernel with in my main program with a simple setup as follows, and the d_RawData is a 1D vector with more than 1024 elements.

[codebox]#define FFTWIDTH 1024

dim3 mygridDim, myblkDim;

mygridDim = dim3(1);

myblkDim = dim3(128);

ChopScale<<<mygridDim, myblkDim>>>((cufftComplex *)d_RawData, FFTWIDTH);[/codebox]

in thread <<<(0,0),(0,0,0)>>> they seem fine:

[codebox]

(cuda-gdb) n

[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

ChopScale () at cudamemtest.cu:106

106 d_RawData[i].y = d_RawData[i].y * 2;

(cuda-gdb) i locals

i = 128

numThreads = 128

threadID = 0

d_RawData = (cufftComplex * const @global) 0xb66d3008

sample_cnt = 1024

[/codebox]

but if I switch to a different thread

codebox thread <<<(0,0),(95,0,0)>>>

Switching to <<<(0,0),(95,0,0)>>> ChopScale () at cudamemtest.cu:106

[Current CUDA Thread <<<(0,0),(95,0,0)>>>]

ChopScale () at cudamemtest.cu:106

106 d_RawData[i].y = d_RawData[i].y * 2;

(cuda-gdb) i locals

i = 255

numThreads = 128

threadID = 127

d_RawData = (cufftComplex * const @global) 0xb66d3008

sample_cnt = 1024[/codebox]

the threadID gives me a wrong value of 127, while it should have been 95, and this sometimes happens to the numThreads var too.

Another thing that confuses me is, when I tried print out the address of the threadID var in both threads and they all give me the same mem address, is it what it supposed to be like that?

Has anyone had the similar/same problems? Or any suggestions/solution to this? Thank you so much.

The configurations for my computer is

MacBook Unibody (Late 2008) with GeForce 9400M

Ubuntu 9.04 32bit with kernel 2.6.28-11-generic

gcc/g++ version 4.3.3

cuda tookit 2.3_linux_32_ubuntu9.04

cuda driver 2.3_linux_32_190.18

Merry Christmas and Happy New Year =)

Upgrade to 3.0b1–the debugger team fixed a lot of the bugs I reported with local variables not being visible or being incorrect from 2.3 to 3.0. Hopefully this is one of them.