shared mem atomics with cc1.1

Hi,

I´m doing a 256 bin histogram-calculation and I´m having problems with emulating atomic operations in shared memory.

I tired an approarch similar to the example in the SDK, but with a different layout of the subhistograms in SM.

I work with compute capability 1.1

here is the basic Idea:

  • 1 subhistogram per bank in shared mem

  • each thread of a halfwarp is responsibel for 1 subhistogram

  • blocksize is 32x8; 8 warps → 16 halfwarps

  • uint16 bins → 4 bit tag, 12 bit counter msb- ttttcccccccccccc -lsb

  • different halfwarp → different tag

  • acces to SM via uint32, masking of subbins (uint16)

memory layout:

bank0			   bank1						   bank15				

-----uint32-------  -----uint32-------			  -----uint32-------

-uint16- -uint16-   -uint16- -uint16-			   -uint16- -uint16-   

-  0   - -  128 -   -   0  - -  128 -	 .....	 -   0  - -  128 - 

-  1   - -  129 -   -   1  - -  129 -			   -   1  - -  129 - 

	   ...				 ...							 ... 

	   ...				 ...							 ... 

- 126  - -  254 -   - 126  - -  254 -			   - 126  - -  254 -   

- 127  - -  255 -   - 127  - -  255 -			   - 127  - -  255 -
__shared__ uint32 pSubHistograms[(HISTOGRAM_BINS / 2) * NUMSMBANKS]; // 128 x 16

   // trash SM

uint32 bankId = threadIdx.x & 0xF; // 0-31 % 16

   uint8 pixel = pImage[globalThreadId];

   uint32 tag = (threadIdx.y * 2 + (threadIdx.x >> 4)) << 28; // halfwarpId 0-16  in 4 most significant bits of uint32

uint8 subbin = (pixel & 0x80) >> 7; // pixel<=127 ->0, pixel>127 -> 1

   uint16 binId = (pixel & 0x7F) * NUMSMBANKS + bankId;

uint32 bins;

   do {

	  bins = pSubHistograms[binId] & 0x0FFFFFFF; // remove 4 most significant bits

	  bins = bins + (1<<(subbin * sizeof(uint16) * 8)) | tag; // increads subbin 1 or 2

	  pSubHistograms[binId] = bins;

   } while(pSubHistograms[binId] != bins);

As each thread within a halfwarp accesses one bank, there are no bank-conflicts, but writeconflicts can only occour between threads with same threadIdx.x & 0xF but different halfwarpIds.

I thought tagging each halfwarp would solve this, nevertheless the code produces correct results in emulation mode, but fails in release.

Thanks for any ideas,

downforme

I have isolated the relevant part to a function, if I replace shared memory with global memory and use atomicAdd() everything works fine, so the error has to be here.

inline __device__ void SharedMemAtomic(volatile uint32* pSubHistograms, uint16 binId, uint32 add, uint32 tag){

	uint32 bins; // 2 uint16 bins  ttttcccccccccccc xxxxcccccccccccc t...tag, c...counter, x... unused, msb->lsb

	do {

		bins = pSubHistograms[binId] & TAGMASK32; // (0x0FFF0FFF)

		bins = tag | (bins + add); // add = {0x00000001, 0x00010000}

		pSubHistograms[binId] = bins;

	} while(pSubHistograms[binId] != bins);

	

	//atomicAdd((uint32*)(&pSubHistograms[binId]), add); // with global memory

}

The tag (uint32) contains the number of the halfwarp (0-15), in the 4 most significant bits.

add (uint32) can be 1 in the first or second uint16 part. (0x00000001, 0x00010000)

I have a 32x8 block, giving me 8 warps which are 16 halfwarps. Each thread in a halfwarp has access to his own bank, so there should be no conflicts within a halfwarp.

If several threads from different halfwarps try to write to the same address, they require a local copy, increment that value, mask the value with their according halwarpId and write it back. As only one can succeed with the write-opperation, the others will read back different values compared to their local copy and repeat the loop until all have succeded.

At least thats what I think that should happen, but actually doesn´t. Any ideas what I got wrong?

downforme