dont understand bank conflicts for shared mem

i am having problem with bank conflicts wrt to shared mem. i am reading nvidia cuda programming guide and in section 5.1.2.5 it is mentioned

ok so this was easy…each bank is 32bit, no brainer. now read this next

so this means at time 16 threads get access to 16 banks, so conflicts if they happen is only within these half-warps.

now pay attention gentleman

this is where my problems start. i am aware if the stride is odd then a thread from the half warp will not share a same bank and so NO conflicts.

But let us look at this wrt to this struct data. Let us consider a half warp of 16 threads want to access the struct.

So shared[BaseIndex + 0] will goto bank0

shared[BaseIndex + 1] will goto bank3

shared[BaseIndex + 2] will goto bank6

.

.

shared[BaseIndex + 5] will goto bank15

but the data size for one element for struct takes 3x32bit, so it is spread to

bank 15(x)-16(y)-17(z). But a half warp can only access 16(0-15)banks. So how do threads with id 5 - 15 access their data, because for them the data is beyond 15 banks?

Data stored in shared memory is “striped” across the 16 banks, so that bank 0 holds words {0,16,32,48,64,…,}, bank 1 holds words {1,17,33,49,65,…}, bank 2 holds words {2,18,34,50,66,…}, etc. So a typical bank conflict scenario for compute 1.0/1.1/1.2/1.3 occurs when successive threads in a half warp read combinations of types and pitches which allows transactions within the same half-warp of threads to hit the same shared memory bank - so reading 32 bit types with a pitch of 16 within a half warp, or 64 bit types with a pitch of 8 within a half warp, etc.

“s” denotes shared memory

bank  0		 1		2	   3		4		5		 6		7		8		9	   10	   11		12	   13	  14	   15

	 s[0].x   s[0].y   s[0].z   s[1].x   s[1].y   s[1].z   s[2].x   s[2].y   s[2].z   s[3].x   s[3].y   s[3].z   s[4].x   s[4].y   s[4].z   s[5].x

	 s[5].y   s[5].z   s[6].x   s[6].y   s[6].z   s[7].x   s[7].y   s[7].z   s[8].x   s[8].y   s[8].z   s[9].x   s[9].y   s[9].z   s[10].x  s[10].y 

	 s[10].z  s[11].x  s[11].y  s[11].z  s[12].x  s[12].y  s[12].z  s[13].x  s[13].y  s[13].z  s[14].x  s[14].y  s[14].z  s[15].x  s[15].y  s[15].z

“s[k].x for k = 0,1,2, …, 15” would access

bank 0 3 6 9 12 15, 2 5 8 11 14 ,1 4 7 10 13

so no bank-conflict

thanks for the explanation aviday and lschen! also ls thanks for that nice illustration :)

i too thought it was striped, but there was another issue that was bugging me. namely that if it is striped then how does eg
s[0].x
s[5].y
s[10].z
thread 0,5,10 get access to x,y,z ? because bank 0 is only 32bit wide and cant store all the three. so at bank 0 either s[0].x or s[5].y or s[10].z can only be stored right? so how do all threads concurrently access their data in 2 clock cycles without swapping?

Each bank holds 256 32 bit words. Just like in LSChien’s excellent diagram, each 3 word structure is written across 3 sequential banks. So each thread per half-warp can read from one of the 16 banks per two clock cycle without conflicts, with each half-warp thread requiring 3 transactions to read the complete structure.

ok i had assumed 1 bank storage size = 32bit , so 1 bank size is actually = 256*32bits (storing in my brain bank!) Could i know in which pdf this info is present.

ah 3 transactions. So how do you define a transaction? i mean is it a read/write operation per bank per thread in 2 clk cycles?

They are just doing the math: size of shared memory in bytes / 4 bytes per 32 bit word / 16 banks

16384 bytes / 4 bytes/word / 16 banks = 256

The size of shared memory (along with other useful information) is given in Appendix A of the CUDA Programming Guide.

great thanks got it. 16kb/16bnk = 1024byte/bank = 256*4bytes