Any ideas why/when:

[codebox] //compress D[8][29] into Dw[8]

```
if(threadIdx.x < 8) shared_Dw[threadIdx.x] = ~0;
if(shared_D[threadIdx.x]==0)
atomicAnd(&shared_Dw[threadIdx.x/29],notpow2[j]);
__syncthreads();
```

[/codebox]

would be slower than the equivelent for loop:

[codebox] if(threadIdx.x < 8) {

unsigned int maskj = 1;

int Dindex = threadIdx.x;

shared_Dw[Dindex] = ~0;

```
for(int j=0;j<29;j++) {
if(shared_D[threadIdx.x*29+j]==0) shared_Dw[Dindex] &= ~maskj;
if(maskj==0x80000000) { maskj = 1; Dindex++; shared_Dw[Dindex] = ~0;
} else { maskj <<= 1; }
}
```

}

[/codebox]

block size = 232 in both cases.

notpow2 is a _*constant* array of 32 unsigned ints starting with ~0x1