another emu x release/debug problem bug?

Do you have any idea why the following code works only in emulation mode and not in release/debug?

[codebox]global static void bitonicSort(int * values)

{

extern __shared__ int shared[];

const unsigned int tid = threadIdx.x;      	

// Copy input to shared mem.

shared[tid] = values[tid];

shared[NUM_THREADS + tid] = values[NUM_THREADS + tid];

__syncthreads();

bool dir = ((tid & 1) == 0);

unsigned int startPos = tid + tid;

unsigned int endPos = startPos + 1;                          			

for (unsigned int j = 1; j>0; j = (j>>1))    

{			        

    if ((shared[startPos] > shared[endPos]) == dir)          

    {

        swap(shared[startPos], shared[endPos]);

    }                                                

    __syncthreads();              

}

// Write result.

values[tid] = shared[tid];

values[NUM_THREADS + tid] = shared[NUM_THREADS + tid];

}[/codebox]

Its quite simple… it should just order every even pair of values in ascending order and every odd pair in descending order

so a sequence:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

should end as:

14 15 13 12 10 11 9 8 6 7 5 4 2 3 1 0

but in release/debug it ends like:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 (nothing is changed)

in emu, it works as expected

Note that if I manually unroll the j-loop then it works (the loop is repeated only once)

The code is heavily modified and simplified version of the bitonic sort algorithm from the SDK, so it needs cutil… just copy it into a new dir in the sdk directory

Tried it on newest cuda 2.1 drivers + toolkit and older 2.0 drivers + toolkit… no difference

(using vista64 with 32-bit toolkit)

project is attached.

Thanks for help.
bitonic.zip (4.94 KB)

I’m guessing this line:
if ((shared[startPos] > shared[endPos]) == dir)
doesn’t work the same way on a GPU as on x86. Everything else looks fine (like race conditions).

Use decuda to see what the machine code looks like.

thanks for response, but I don’t think that’s the reason why it doesn’t work. As I already mentioned, it works fine if I unroll the loop. Probably just a bug… Is there any official way how to report it? (I have really try to find it… unfortunately without any success)

Did you look at the forum sticky? http://forums.nvidia.com/index.php?showtopic=28127

BTW, it is best if you post a full compilable (i.e can be compiled by nvcc -o test test.cu) example code that demonstrates the problem.

Yes, I looked there, but I had missed this topic. Thanks