i am having problem with bank conflicts wrt to shared mem. i am reading nvidia cuda programming guide and in section 5.1.2.5 it is mentioned
ok so this was easy…each bank is 32bit, no brainer. now read this next
so this means at time 16 threads get access to 16 banks, so conflicts if they happen is only within these half-warps.
now pay attention gentleman
this is where my problems start. i am aware if the stride is odd then a thread from the half warp will not share a same bank and so NO conflicts.
But let us look at this wrt to this struct data. Let us consider a half warp of 16 threads want to access the struct.
So shared[BaseIndex + 0] will goto bank0
shared[BaseIndex + 1] will goto bank3
shared[BaseIndex + 2] will goto bank6
.
.
shared[BaseIndex + 5] will goto bank15
but the data size for one element for struct takes 3x32bit, so it is spread to
bank 15(x)-16(y)-17(z). But a half warp can only access 16(0-15)banks. So how do threads with id 5 - 15 access their data, because for them the data is beyond 15 banks?
Data stored in shared memory is “striped” across the 16 banks, so that bank 0 holds words {0,16,32,48,64,…,}, bank 1 holds words {1,17,33,49,65,…}, bank 2 holds words {2,18,34,50,66,…}, etc. So a typical bank conflict scenario for compute 1.0/1.1/1.2/1.3 occurs when successive threads in a half warp read combinations of types and pitches which allows transactions within the same half-warp of threads to hit the same shared memory bank - so reading 32 bit types with a pitch of 16 within a half warp, or 64 bit types with a pitch of 8 within a half warp, etc.
thanks for the explanation aviday and lschen! also ls thanks for that nice illustration :)
i too thought it was striped, but there was another issue that was bugging me. namely that if it is striped then how does eg
s[0].x
s[5].y
s[10].z
thread 0,5,10 get access to x,y,z ? because bank 0 is only 32bit wide and cant store all the three. so at bank 0 either s[0].x or s[5].y or s[10].z can only be stored right? so how do all threads concurrently access their data in 2 clock cycles without swapping?
Each bank holds 256 32 bit words. Just like in LSChien’s excellent diagram, each 3 word structure is written across 3 sequential banks. So each thread per half-warp can read from one of the 16 banks per two clock cycle without conflicts, with each half-warp thread requiring 3 transactions to read the complete structure.
ok i had assumed 1 bank storage size = 32bit , so 1 bank size is actually = 256*32bits (storing in my brain bank!) Could i know in which pdf this info is present.
ah 3 transactions. So how do you define a transaction? i mean is it a read/write operation per bank per thread in 2 clk cycles?