Shared Memory Problems ... Conflict free access

Hi,

i am writing a histogram kernel, for which I need to write to an unsigned integer array of length 512 in shared memory. Every thread in a warp has its 16 uint long subhistogram of the array to write to. In order to achieve write conflict and bank conflict free access by a warp, I compute the the indices by the following function:

__device__ inline unsigned int getWarpHistoIndex(unsigned int tid, unsigned int bin)

{

	unsigned int offset = (((tid & 0x1F) >> 4) << 8) + (tid & 0x0F);

	return (offset + (bin << 4));

}

So if a thread wants to increase a bin, it uses the function like this:

__shared__ unsigned int warp_histograms[512];

unsigned int tid= threadIdx.y * blockDim.x + threadIdx.x;

warp_histograms[getWarpHistoIndex(tid, (bin >> 2))] += (0x01000000 >> ((bin & 0x03) << 3));

This works fine in device emulation mode, but in release mode I have a race condition. It seems like threads are trying to write to the same location. Can I really assume, that always 32 threads are executed in parallel? On the other hand: Two threads writing to he same location should provoke a warp serialize, but I don’t see any when profiling. So what am I doing wrong?

Thanks in advance,

Kwyjibo

Memory accesses are handle per half-warp for pre-Fermi and per warp for Fermi cards. You wont have any warp serialize in Fermi cards, if 2 threads access the same word in the same bank. So this might be the reason for why you dont see any bank conflicts. As for emu mode, dont rely on it. It is no real emulation of a CUDA-device so you wont see many problems you would have in device mode when running in emu. Its just for like running device threads on host and was removed in CUDA 3.1. I havent looked at your bit-shifting magic, though :)

Memory accesses are handle per half-warp for pre-Fermi and per warp for Fermi cards. You wont have any warp serialize in Fermi cards, if 2 threads access the same word in the same bank. So this might be the reason for why you dont see any bank conflicts. As for emu mode, dont rely on it. It is no real emulation of a CUDA-device so you wont see many problems you would have in device mode when running in emu. Its just for like running device threads on host and was removed in CUDA 3.1. I havent looked at your bit-shifting magic, though :)

Just to rule out the obvious: Is it guaranteed that [font=“Courier New”]bin < 16[/font] and [font=“Courier New”]tid < 32[/font] (blocksize of 32)?

ONeill: The bit-shifting magic maybe looks a bit less magic if you write it as [font=“Courier New”]unsigned int offset = ((tid & 0x10) << 4) + (tid & 0x0F);[/font] ;)

Just to rule out the obvious: Is it guaranteed that [font=“Courier New”]bin < 16[/font] and [font=“Courier New”]tid < 32[/font] (blocksize of 32)?

ONeill: The bit-shifting magic maybe looks a bit less magic if you write it as [font=“Courier New”]unsigned int offset = ((tid & 0x10) << 4) + (tid & 0x0F);[/font] ;)

Yes, bin < 16 is guaranteed. But my blocksize is 16 x 16, but I don’t see, why this is a problem? Isn’t (tid & 0x1F) doing a modulo 32?

I think I have to explain my “bit shift magic” a bit. To my understanding the uints in shared memory are distributed across the memory banks like this:

So the only way to guarantee bank conflict free access is that

Thread 0 of the warp writes its histogram to bank 0 (from 0 to 240),

Thread 1 of the warp writes its histogram to bank 1 (from 0 to 241),

Thread 16 of the warp writes its histogram to bank 0 (from 256 to 496),

Thread 17 of the warp writes its histogram to bank 1 (from 257 to 497),

Thread 31 of the warp writes its histogram to bank 15 (from 271 to 511).

Half warps don’t collide, so this should be fine.

Hope, this helps,

Kwyjibo

Yes, bin < 16 is guaranteed. But my blocksize is 16 x 16, but I don’t see, why this is a problem? Isn’t (tid & 0x1F) doing a modulo 32?

I think I have to explain my “bit shift magic” a bit. To my understanding the uints in shared memory are distributed across the memory banks like this:

So the only way to guarantee bank conflict free access is that

Thread 0 of the warp writes its histogram to bank 0 (from 0 to 240),

Thread 1 of the warp writes its histogram to bank 1 (from 0 to 241),

Thread 16 of the warp writes its histogram to bank 0 (from 256 to 496),

Thread 17 of the warp writes its histogram to bank 1 (from 257 to 497),

Thread 31 of the warp writes its histogram to bank 15 (from 271 to 511).

Half warps don’t collide, so this should be fine.

Hope, this helps,

Kwyjibo

Yes, but then you get a race between different warps. You might expect to get away with this on pre-Fermi hardware if the increment compiles to a single instruction, but the 24 cycle latency opens a rather large window for the race.

On compute 2.0 it will also fail since two warps are executing in parallel there (using (tid & 0x3F) and a 1024 entry table could fix that). The GTX 460’s superscalar architecture means further trouble.

Yes, but then you get a race between different warps. You might expect to get away with this on pre-Fermi hardware if the increment compiles to a single instruction, but the 24 cycle latency opens a rather large window for the race.

On compute 2.0 it will also fail since two warps are executing in parallel there (using (tid & 0x3F) and a 1024 entry table could fix that). The GTX 460’s superscalar architecture means further trouble.

So I can’t assume the Multiprocessor always executes 32 threads at once? How can I get around this? The first solution I can think of would be a histogram for every thread, but that would consume much more memory, which is bad for my occupancy :(

Can you explain how histo256 SDK sample deals with that problem? To my understanding they are using a quite similar approach as mine in the sample.

So I can’t assume the Multiprocessor always executes 32 threads at once? How can I get around this? The first solution I can think of would be a histogram for every thread, but that would consume much more memory, which is bad for my occupancy :(

Can you explain how histo256 SDK sample deals with that problem? To my understanding they are using a quite similar approach as mine in the sample.

Keep the 16 subhistograms in 9 registers + 7 shared, increment using a 10 way switch with indexing on the 7 shared?

Keep the 16 subhistograms in 9 registers + 7 shared, increment using a 10 way switch with indexing on the 7 shared?

I guess, this would lead to heavy serialization of the kernel, because of the switch.

I have just played around with the occupancy calculator, assuming 64 byte of shared meory for every thread. The maximum occupancy I get is 33% for Compute Capability 1.1 and 50 % for compute capability 2.0. Not very encouraging.

Has anybody here ever profiled the SDK samples? I’m interested in their occupancy, but I can’t profile them myself, because I use VS2008 and the SDK samples are for VS2005.

I guess, this would lead to heavy serialization of the kernel, because of the switch.

I have just played around with the occupancy calculator, assuming 64 byte of shared meory for every thread. The maximum occupancy I get is 33% for Compute Capability 1.1 and 50 % for compute capability 2.0. Not very encouraging.

Has anybody here ever profiled the SDK samples? I’m interested in their occupancy, but I can’t profile them myself, because I use VS2008 and the SDK samples are for VS2005.

Occupancy often is overrated. It is important to have at least 6 warps per SM for compute 1.x and 12 for 2.0, which corresponds to 25% occupancy on 1.0 and 2.0, and 18.75% on other 1.x devices. Higher occupancy merely serves to hide memory latency. In the histogram case the memory access is very regular, and the latency can just as well be hidden through software pipelining/prefetching. Since the whole kernel is bandwidth bound anyway, it might not even be necessary to hide latency at all.

Occupancy often is overrated. It is important to have at least 6 warps per SM for compute 1.x and 12 for 2.0, which corresponds to 25% occupancy on 1.0 and 2.0, and 18.75% on other 1.x devices. Higher occupancy merely serves to hide memory latency. In the histogram case the memory access is very regular, and the latency can just as well be hidden through software pipelining/prefetching. Since the whole kernel is bandwidth bound anyway, it might not even be necessary to hide latency at all.

By the way, if you need more than 16 bins atomic adds in shared memory will probably work well - as the kernel is bandwidth bound, they basically come free.

By the way, if you need more than 16 bins atomic adds in shared memory will probably work well - as the kernel is bandwidth bound, they basically come free.

Unfortunately I can’t have atomic adds in shared memory with CC 1.1.

I’d say, we can close this thread. Thank you for your tips and your time!