Do you have any idea why the following code works only in emulation mode and not in release/debug?
[codebox]global static void bitonicSort(int * values)
{
extern __shared__ int shared[];
const unsigned int tid = threadIdx.x;
// Copy input to shared mem.
shared[tid] = values[tid];
shared[NUM_THREADS + tid] = values[NUM_THREADS + tid];
__syncthreads();
bool dir = ((tid & 1) == 0);
unsigned int startPos = tid + tid;
unsigned int endPos = startPos + 1;
for (unsigned int j = 1; j>0; j = (j>>1))
{
if ((shared[startPos] > shared[endPos]) == dir)
{
swap(shared[startPos], shared[endPos]);
}
__syncthreads();
}
// Write result.
values[tid] = shared[tid];
values[NUM_THREADS + tid] = shared[NUM_THREADS + tid];
}[/codebox]
Its quite simple… it should just order every even pair of values in ascending order and every odd pair in descending order
so a sequence:
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
should end as:
14 15 13 12 10 11 9 8 6 7 5 4 2 3 1 0
but in release/debug it ends like:
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 (nothing is changed)
in emu, it works as expected
Note that if I manually unroll the j-loop then it works (the loop is repeated only once)
The code is heavily modified and simplified version of the bitonic sort algorithm from the SDK, so it needs cutil… just copy it into a new dir in the sdk directory
Tried it on newest cuda 2.1 drivers + toolkit and older 2.0 drivers + toolkit… no difference
(using vista64 with 32-bit toolkit)
project is attached.
Thanks for help.
bitonic.zip (4.94 KB)