Hi,
I´m doing a 256 bin histogram-calculation and I´m having problems with emulating atomic operations in shared memory.
I tired an approarch similar to the example in the SDK, but with a different layout of the subhistograms in SM.
I work with compute capability 1.1
here is the basic Idea:
-
1 subhistogram per bank in shared mem
-
each thread of a halfwarp is responsibel for 1 subhistogram
-
blocksize is 32x8; 8 warps → 16 halfwarps
-
uint16 bins → 4 bit tag, 12 bit counter msb- ttttcccccccccccc -lsb
-
different halfwarp → different tag
-
acces to SM via uint32, masking of subbins (uint16)
memory layout:
bank0 bank1 bank15
-----uint32------- -----uint32------- -----uint32-------
-uint16- -uint16- -uint16- -uint16- -uint16- -uint16-
- 0 - - 128 - - 0 - - 128 - ..... - 0 - - 128 -
- 1 - - 129 - - 1 - - 129 - - 1 - - 129 -
... ... ...
... ... ...
- 126 - - 254 - - 126 - - 254 - - 126 - - 254 -
- 127 - - 255 - - 127 - - 255 - - 127 - - 255 -
__shared__ uint32 pSubHistograms[(HISTOGRAM_BINS / 2) * NUMSMBANKS]; // 128 x 16
// trash SM
uint32 bankId = threadIdx.x & 0xF; // 0-31 % 16
uint8 pixel = pImage[globalThreadId];
uint32 tag = (threadIdx.y * 2 + (threadIdx.x >> 4)) << 28; // halfwarpId 0-16 in 4 most significant bits of uint32
uint8 subbin = (pixel & 0x80) >> 7; // pixel<=127 ->0, pixel>127 -> 1
uint16 binId = (pixel & 0x7F) * NUMSMBANKS + bankId;
uint32 bins;
do {
bins = pSubHistograms[binId] & 0x0FFFFFFF; // remove 4 most significant bits
bins = bins + (1<<(subbin * sizeof(uint16) * 8)) | tag; // increads subbin 1 or 2
pSubHistograms[binId] = bins;
} while(pSubHistograms[binId] != bins);
As each thread within a halfwarp accesses one bank, there are no bank-conflicts, but writeconflicts can only occour between threads with same threadIdx.x & 0xF but different halfwarpIds.
I thought tagging each halfwarp would solve this, nevertheless the code produces correct results in emulation mode, but fails in release.
Thanks for any ideas,
downforme