atomicAnd on shared unsigned int slower than for r1.64

wlangdon · October 16, 2010, 1:27pm

Any ideas why/when:

[codebox] //compress D[8][29] into Dw[8]

if(threadIdx.x < 8) shared_Dw[threadIdx.x] = ~0;

if(shared_D[threadIdx.x]==0) 

  atomicAnd(&shared_Dw[threadIdx.x/29],notpow2[j]);

__syncthreads();

[/codebox]

would be slower than the equivelent for loop:

[codebox] if(threadIdx.x < 8) {

unsigned int maskj = 1;

int Dindex = threadIdx.x;

shared_Dw[Dindex] = ~0;

for(int j=0;j<29;j++) {

   if(shared_D[threadIdx.x*29+j]==0) shared_Dw[Dindex] &= ~maskj;

   if(maskj==0x80000000) { maskj = 1; Dindex++; shared_Dw[Dindex] = ~0;

   } else {                maskj <<= 1; }

}

}

[/codebox]

block size = 232 in both cases.

notpow2 is a _constant array of 32 unsigned ints starting with ~0x1

wlangdon · October 16, 2010, 1:27pm

Any ideas why/when:

[codebox] //compress D[8][29] into Dw[8]

if(threadIdx.x < 8) shared_Dw[threadIdx.x] = ~0;

if(shared_D[threadIdx.x]==0) 

  atomicAnd(&shared_Dw[threadIdx.x/29],notpow2[j]);

__syncthreads();

[/codebox]

would be slower than the equivelent for loop:

[codebox] if(threadIdx.x < 8) {

unsigned int maskj = 1;

int Dindex = threadIdx.x;

shared_Dw[Dindex] = ~0;

for(int j=0;j<29;j++) {

   if(shared_D[threadIdx.x*29+j]==0) shared_Dw[Dindex] &= ~maskj;

   if(maskj==0x80000000) { maskj = 1; Dindex++; shared_Dw[Dindex] = ~0;

   } else {                maskj <<= 1; }

}

}

[/codebox]

block size = 232 in both cases.

notpow2 is a _constant array of 32 unsigned ints starting with ~0x1

wlangdon · October 16, 2010, 2:48pm

opps turns out to be a bug
Sorry
Bill
ps: is there a way to remove erronous posts?

wlangdon · October 16, 2010, 2:48pm

opps turns out to be a bug
Sorry
Bill
ps: is there a way to remove erronous posts?

SPWorley · October 16, 2010, 6:28pm

You might also look at the __ballot() intrinsic, which can do what you want much faster and easier… it accumulates 32 bitflags per warp into one word using a single instruction.

SPWorley · October 16, 2010, 6:28pm

You might also look at the __ballot() intrinsic, which can do what you want much faster and easier… it accumulates 32 bitflags per warp into one word using a single instruction.

wlangdon · October 16, 2010, 7:48pm

Dear Steve,
thanks for the suggestion.
At present I am running GeForce GTX 295 which I think means its a 1.3 device
and so does not support __ballot() which I think needs 2.0
However the C2050 should do.
Thanks again
Bill

wlangdon · October 16, 2010, 7:48pm

Dear Steve,
thanks for the suggestion.
At present I am running GeForce GTX 295 which I think means its a 1.3 device
and so does not support __ballot() which I think needs 2.0
However the C2050 should do.
Thanks again
Bill