Parallel Bit Operations on one char

Hi all,

I got a problem in CUDA that keeps me busy for two days know. Its alll about this line of code executed on the device:

s_result[(tx+ty*bw)/8] |= 1<<tx%8;

Eight threads in parallel shift a 1 to the left (threadID modulo 8 times) and then write them by an or to one character.

It works fine in device-emulation mode but I receive incorrect values for true parallel execution. Any ideas why? Is it even possible to acces one char by several threads simultaneously?

Thanks

Picknick3r

In general you will get undefined results if multiple threads try to write to the same memory location at the same time. It doesn’t matter if you are using logical OR to try and only change a single bit in the word - the compiler will likely implement this as a read of the word to a register, followed by the or, followed by a write and there is no guarantee that another thread won’t interrupt this.

This is what the atomic operations in compute 1.1 were designed for.

Is s_result in shared memory?

Mmh this is bad news since I’am developing on a GF8800GTS which as far as I know doesn’t support 1.1.

Any advice how I could do something like that on 1.0?

Yes s_result is in shared memory.

You could exploit the warp using some kind of tag, or do a reduction.
Or, I once got around this by rearranging my kernel so that access to the same element is distributed over blockIdx.y, instead of threadIdx.x. Though theoretically the result is undefined, I have a very large blockIdx.x and I always get the correct result in practice. It may hurt performance a bit, though.

More info here. There has been no hint at any atomic ops for shared memory at any time in the future…

Eric

Hi all,

I am still working on this problem. Here is what the current status looks like:

...

#define BIT(_pos_) (1<<(_pos_))

#define BYTEIDX(_x_) ((_x_)/8)

#define BITIDX(_x_) ((_x_)%8)

...

// Independent Pix index

int indPix = (thid%32)*8 + thid/32;

...

if (getPixel(s_in1data, indPix)==getPixel(s_in2data, indPix)){ 

	s_index[BYTEIDX(indPix)]&= ~(BIT(BITIDX(indPix)));

}else

{

	s_index[BYTEIDX(indPix)]|= BIT(BITIDX(indPix));

}

...

By introducing the indPix index I tried to avoid parallel writes to same char, by using the wrap size of 32 to parallelize pixel access. That is only every 33rd thread writes to the same char and since only 32 threads are executed truely parallel it should work.

Sadly it doesn’t :no: , I am still getting different results for device-emulation and native execution.

Any help is appreciated.

Thx Stefan

How about:

for( int i=0; i< 32; i++ ) {
__syncthreads();
if (i==tthreadIdx%32) {}//perform atomic op
}

I think that will work but it will take 32 times longer than the other version. I only want to understand why my version doesn’t work to avoid errors like that in the future.

Thanks anyways, alex.

Picknick3r

Though only threads in one warp are executed truly concurrently, warps are time sliced.
The &= and |= translate to many instructions, including a load and a store. if a time slice happens between load and store, the result is screwed.

I did something similar and used a function much like this one. If it’s the fastest alternative, I don’t know, but it solved my problem at a sufficient speed at least.

 #define BLKWID 160

  __shared__ int buffer[BLKWID];

  constant int tx = threadIdx.x;

  int indPix = 32*blockIdx.y*width + (BLKWID*blockIdx.x + tx);

 int value = 0;

  for (int y=0;y<32;y++) {

     value >>= 1;

     if (s_in1data[indPix]==s_in2data[indPix])       

       value |= 0x8000000;

     indPix += width;

  }

  buffer[tx] = value;

My function may be slower than another solution that’s specific to your algorithm, but it should be much faster than an atomic op to global memory. If you need per-block or per-warp atomics, I’d suggest implementing them in this round-robin way using shared memory.

Thanks for all your help, this is a great forum. I implemented alex’s function and am getting correct results now.