shared mem atomics with cc1.1

downforme · March 8, 2010, 5:37pm

Hi,

IÂ´m doing a 256 bin histogram-calculation and IÂ´m having problems with emulating atomic operations in shared memory.

I tired an approarch similar to the example in the SDK, but with a different layout of the subhistograms in SM.

I work with compute capability 1.1

here is the basic Idea:

1 subhistogram per bank in shared mem
each thread of a halfwarp is responsibel for 1 subhistogram
blocksize is 32x8; 8 warps → 16 halfwarps
uint16 bins → 4 bit tag, 12 bit counter msb- ttttcccccccccccc -lsb
different halfwarp → different tag
acces to SM via uint32, masking of subbins (uint16)

memory layout:

bank0			   bank1						   bank15				

-----uint32-------  -----uint32-------			  -----uint32-------

-uint16- -uint16-   -uint16- -uint16-			   -uint16- -uint16-   

-  0   - -  128 -   -   0  - -  128 -	 .....	 -   0  - -  128 - 

-  1   - -  129 -   -   1  - -  129 -			   -   1  - -  129 - 

	   ...				 ...							 ... 

	   ...				 ...							 ... 

- 126  - -  254 -   - 126  - -  254 -			   - 126  - -  254 -   

- 127  - -  255 -   - 127  - -  255 -			   - 127  - -  255 -

__shared__ uint32 pSubHistograms[(HISTOGRAM_BINS / 2) * NUMSMBANKS]; // 128 x 16

   // trash SM

uint32 bankId = threadIdx.x & 0xF; // 0-31 % 16

   uint8 pixel = pImage[globalThreadId];

   uint32 tag = (threadIdx.y * 2 + (threadIdx.x >> 4)) << 28; // halfwarpId 0-16  in 4 most significant bits of uint32

uint8 subbin = (pixel & 0x80) >> 7; // pixel<=127 ->0, pixel>127 -> 1

   uint16 binId = (pixel & 0x7F) * NUMSMBANKS + bankId;

uint32 bins;

   do {

	  bins = pSubHistograms[binId] & 0x0FFFFFFF; // remove 4 most significant bits

	  bins = bins + (1<<(subbin * sizeof(uint16) * 8)) | tag; // increads subbin 1 or 2

	  pSubHistograms[binId] = bins;

   } while(pSubHistograms[binId] != bins);

As each thread within a halfwarp accesses one bank, there are no bank-conflicts, but writeconflicts can only occour between threads with same threadIdx.x & 0xF but different halfwarpIds.

I thought tagging each halfwarp would solve this, nevertheless the code produces correct results in emulation mode, but fails in release.

Thanks for any ideas,

downforme

downforme · March 10, 2010, 6:47pm

I have isolated the relevant part to a function, if I replace shared memory with global memory and use atomicAdd() everything works fine, so the error has to be here.

inline __device__ void SharedMemAtomic(volatile uint32* pSubHistograms, uint16 binId, uint32 add, uint32 tag){

	uint32 bins; // 2 uint16 bins  ttttcccccccccccc xxxxcccccccccccc t...tag, c...counter, x... unused, msb->lsb

	do {

		bins = pSubHistograms[binId] & TAGMASK32; // (0x0FFF0FFF)

		bins = tag | (bins + add); // add = {0x00000001, 0x00010000}

		pSubHistograms[binId] = bins;

	} while(pSubHistograms[binId] != bins);

	

	//atomicAdd((uint32*)(&pSubHistograms[binId]), add); // with global memory

}

The tag (uint32) contains the number of the halfwarp (0-15), in the 4 most significant bits.

add (uint32) can be 1 in the first or second uint16 part. (0x00000001, 0x00010000)

I have a 32x8 block, giving me 8 warps which are 16 halfwarps. Each thread in a halfwarp has access to his own bank, so there should be no conflicts within a halfwarp.

If several threads from different halfwarps try to write to the same address, they require a local copy, increment that value, mask the value with their according halwarpId and write it back. As only one can succeed with the write-opperation, the others will read back different values compared to their local copy and repeat the loop until all have succeded.

At least thats what I think that should happen, but actually doesnÂ´t. Any ideas what I got wrong?

downforme

Topic		Replies	Views
Histogram computation CUDA Programming and Performance	7	6382	February 8, 2008
Shared memory write conflicts Looking for a little help... CUDA Programming and Performance	5	4933	September 7, 2007
Shared Memory Problems ... Conflict free access CUDA Programming and Performance	22	3534	August 24, 2010
Atomic operation problem CUDA Programming and Performance	2	881	June 2, 2008
atomic operations to shared memory CUDA Programming and Performance	0	2135	October 14, 2008
atomicInc for shared memory with CC1.1 CUDA Programming and Performance	4	6926	March 5, 2011
Shared memory atomics and other questions. CUDA Programming and Performance	19	13805	November 13, 2010
atomicAdd and shared memory issue Running the histogram code from "Cuda by example" book. CUDA Programming and Performance	2	26097	June 20, 2011
Performance of Histogram256 demo Atomic writes are slow when conflict CUDA Programming and Performance	5	5785	August 1, 2008
Atomic addition for floats in shared memory application: weighted histogram CUDA Programming and Performance	1	1912	February 20, 2008

shared mem atomics with cc1.1

Related topics