do not understand bank conflicts please help

Hi,

on page 89 in Cuda Programming Guide 2.2 there are two examples of bank conflicts:

__shared__ struct type shared[32];

struct type data = shared[BaseIndex+tid];

it says that if the data accesed based on tid and the type is declared as this, it will result in NO bank conflict

struct type {

	float x,y,z;

}

and if the type is declared as this it WILL have a bank conflict

struct type {

	float x,y;

}

what i do not understand , is what the differences between the two? why 3 consecutive floats in a struct will not have conflict but two will? As the width of the bank is 32 bits, all the floats will use a separate bank. So where is the conflict??

Thanks in advance, as always.

in 1st case compiler should be reading data as:
x.x = ((float*) shared)[tid * 3 + 0];
x.y = ((float*) shared)[tid * 3 + 1];
x.z = ((float*) shared)[tid * 3 + 2];

Imagine threads from the 1st half-wrap where tid ranges from 0 to 15, that means that we address following banks [((0…15)*3 + 0) % 16] = [0,3,6,9,12,15,2,5,8,11,14,1,4,7,10,13]. As you can see threads address different banks when reading ‘x’, same goes for ‘y’ and ‘z’

now 2nd case
x.x = ((float*) shared)[tid * 2 + 0];
x.y = ((float*) shared)[tid * 2 + 1];
Reading ‘x’ addresses following banks [((0…15)*2 + 0) % 16] = [0,2,4,6,8,10,12,14,0,2,4,6,8,10,12,14]. As you can see here, every other bank is referenced 2 times, giving you a 2-way bank conflict. Same happens when reading ‘y’.

thanks!

I fail to understand why it is (0…15) instead of (0…31) in “[((0…15)*3 + 0) % 16]”. Because in “__shared__struct type shared[32];”, there are 32 types. Am I understanding correctly?

Thanks.

Deryk.

At the time the original question was posted, all CUDA capable GPUs accessed shared memory in half-warps, i.e. threads 0…15 used one cycle and threads 16…31 used a second cycle. So there could never be any bank conflicts between the lower and the upper half-warp.

This is still true for devices of compute capability 1.x. On devices of compute capability 2.x however a full warp accesses shared memory in one cycle, so your understanding is correct there.

Why the ‘modulo 16’ operation? Why the number ‘16’? Is it related to the number of threads in a half-warp which is equal to 16? Or has it anything to do with the total number of memory banks (…is that equal to 16?) ? The “[((0…15)*3 + 0) % 16]” thing suggests a layout of the memory which is equivalent to a matrix of 16 columns, with each column representing a ‘memory bank’, and addresses 0 to 15 belonging to row 1, addresses 16 to 31 belonging to row 2, and so forth (thus addresses 0,16,32… belong to column 0 and hence memory bank 0, addresses 1,17,33… belong to memory bank 1, and so forth). Is this understanding of memory bank correct?

You are right, it relates to number of memory banks, 32 for now (NVIDIA CUDA Programming Guide, F.4.3. Shared memory has 32 banks…)

Major thanks to sergeyn.

I searched everywhere on the net for several hours and found no good answer. Stack Overflow, university lecture materials, GPU Gems 3 and not a single one of them clearly explained how exactly the bank conflict occurred. Yet in one small post, you illustrated the answer perfectly. Thank you SO much.

I always wonder why it’s so difficult for highly paid professional authors and professors to convey the most simple concepts.