Weird error recurring every 512 times

I am trying to compact a vector (remove all zeros). To do so I plan to do a prefix scan and I used the scan_largearray exampe in the sdk, and modified the loadSharedChunkFromMem (num_elements = 2048) I have been struggling with the following problem (using visual studio 2005, windows xp, EmuDebug). In the following code g_idata[0…2047] = 1.0f and threshold[0…2047] = 0.0f.

The surprise is that the original code gives the right result the scan_largearray returns an array with all elements = i, but the modified code just drops an element every 512th element.

                was         now

result[0] 0 0
result[1] 1 1

result[511] 511 511
result[512] 512 1
result[513] 513 2

result[1023] 1023 512
result[1024] 1024 2
result[1025] 1025 3

Am I missing something silly?

device void loadSharedChunkFromMem(const float *g_idata,
const float *threshold,
int& ai, int& bi,
int& mem_ai, int& mem_bi)
{
mem_ai = __mul24(blockIdx.x, (blockDim.x << 1)) + threadIdx.x;
mem_bi = mem_ai + blockDim.x;

ai = threadIdx.x;
bi = threadIdx.x + blockDim.x;

// compute spacing to avoid bank conflicts
ai += (ai >> LOG_NUM_BANKS);
bi += (bi >> LOG_NUM_BANKS);

//was 
//TEMP(ai) = g_idata[mem_ai];
//TEMP(bi) = g_idata[mem_bi];

//now
TEMP(ai) = (g_idata[mem_ai]>=threshold[mem_ai]) ? 1.0f : 0.0f;
TEMP(bi) = (g_idata[mem_bi]>=threshold[mem_bi]) ? 1.0f : 0.0f;

}