Any ideas why/when:
[codebox] //compress D[8][29] into Dw[8]
if(threadIdx.x < 8) shared_Dw[threadIdx.x] = ~0;
if(shared_D[threadIdx.x]==0)
atomicAnd(&shared_Dw[threadIdx.x/29],notpow2[j]);
__syncthreads();
[/codebox]
would be slower than the equivelent for loop:
[codebox] if(threadIdx.x < 8) {
unsigned int maskj = 1;
int Dindex = threadIdx.x;
shared_Dw[Dindex] = ~0;
for(int j=0;j<29;j++) {
if(shared_D[threadIdx.x*29+j]==0) shared_Dw[Dindex] &= ~maskj;
if(maskj==0x80000000) { maskj = 1; Dindex++; shared_Dw[Dindex] = ~0;
} else { maskj <<= 1; }
}
}
[/codebox]
block size = 232 in both cases.
notpow2 is a _constant array of 32 unsigned ints starting with ~0x1
Any ideas why/when:
[codebox] //compress D[8][29] into Dw[8]
if(threadIdx.x < 8) shared_Dw[threadIdx.x] = ~0;
if(shared_D[threadIdx.x]==0)
atomicAnd(&shared_Dw[threadIdx.x/29],notpow2[j]);
__syncthreads();
[/codebox]
would be slower than the equivelent for loop:
[codebox] if(threadIdx.x < 8) {
unsigned int maskj = 1;
int Dindex = threadIdx.x;
shared_Dw[Dindex] = ~0;
for(int j=0;j<29;j++) {
if(shared_D[threadIdx.x*29+j]==0) shared_Dw[Dindex] &= ~maskj;
if(maskj==0x80000000) { maskj = 1; Dindex++; shared_Dw[Dindex] = ~0;
} else { maskj <<= 1; }
}
}
[/codebox]
block size = 232 in both cases.
notpow2 is a _constant array of 32 unsigned ints starting with ~0x1
opps turns out to be a bug
Sorry
Bill
ps: is there a way to remove erronous posts?
opps turns out to be a bug
Sorry
Bill
ps: is there a way to remove erronous posts?
You might also look at the __ballot() intrinsic, which can do what you want much faster and easier… it accumulates 32 bitflags per warp into one word using a single instruction.
You might also look at the __ballot() intrinsic, which can do what you want much faster and easier… it accumulates 32 bitflags per warp into one word using a single instruction.
Dear Steve,
thanks for the suggestion.
At present I am running GeForce GTX 295 which I think means its a 1.3 device
and so does not support __ballot() which I think needs 2.0
However the C2050 should do.
Thanks again
Bill
Dear Steve,
thanks for the suggestion.
At present I am running GeForce GTX 295 which I think means its a 1.3 device
and so does not support __ballot() which I think needs 2.0
However the C2050 should do.
Thanks again
Bill