dont understand bank conflicts for shared mem

Dr.Synth · March 31, 2010, 1:34pm

i am having problem with bank conflicts wrt to shared mem. i am reading nvidia cuda programming guide and in section 5.1.2.5 it is mentioned

ok so this was easy…each bank is 32bit, no brainer. now read this next

so this means at time 16 threads get access to 16 banks, so conflicts if they happen is only within these half-warps.

now pay attention gentleman

this is where my problems start. i am aware if the stride is odd then a thread from the half warp will not share a same bank and so NO conflicts.

But let us look at this wrt to this struct data. Let us consider a half warp of 16 threads want to access the struct.

So shared[BaseIndex + 0] will goto bank0

shared[BaseIndex + 1] will goto bank3

shared[BaseIndex + 2] will goto bank6

.

shared[BaseIndex + 5] will goto bank15

but the data size for one element for struct takes 3x32bit, so it is spread to

bank 15(x)-16(y)-17(z). But a half warp can only access 16(0-15)banks. So how do threads with id 5 - 15 access their data, because for them the data is beyond 15 banks?

avidday · March 31, 2010, 1:48pm

Data stored in shared memory is “striped” across the 16 banks, so that bank 0 holds words {0,16,32,48,64,…,}, bank 1 holds words {1,17,33,49,65,…}, bank 2 holds words {2,18,34,50,66,…}, etc. So a typical bank conflict scenario for compute 1.0/1.1/1.2/1.3 occurs when successive threads in a half warp read combinations of types and pitches which allows transactions within the same half-warp of threads to hit the same shared memory bank - so reading 32 bit types with a pitch of 16 within a half warp, or 64 bit types with a pitch of 8 within a half warp, etc.

LSChien · March 31, 2010, 1:53pm

“s” denotes shared memory

bank  0		 1		2	   3		4		5		 6		7		8		9	   10	   11		12	   13	  14	   15

	 s[0].x   s[0].y   s[0].z   s[1].x   s[1].y   s[1].z   s[2].x   s[2].y   s[2].z   s[3].x   s[3].y   s[3].z   s[4].x   s[4].y   s[4].z   s[5].x

	 s[5].y   s[5].z   s[6].x   s[6].y   s[6].z   s[7].x   s[7].y   s[7].z   s[8].x   s[8].y   s[8].z   s[9].x   s[9].y   s[9].z   s[10].x  s[10].y 

	 s[10].z  s[11].x  s[11].y  s[11].z  s[12].x  s[12].y  s[12].z  s[13].x  s[13].y  s[13].z  s[14].x  s[14].y  s[14].z  s[15].x  s[15].y  s[15].z

“s[k].x for k = 0,1,2, …, 15” would access

bank 0 3 6 9 12 15, 2 5 8 11 14 ,1 4 7 10 13

so no bank-conflict

Dr.Synth · March 31, 2010, 2:45pm

thanks for the explanation aviday and lschen! also ls thanks for that nice illustration :)

i too thought it was striped, but there was another issue that was bugging me. namely that if it is striped then how does eg
s[0].x
s[5].y
s[10].z
thread 0,5,10 get access to x,y,z ? because bank 0 is only 32bit wide and cant store all the three. so at bank 0 either s[0].x or s[5].y or s[10].z can only be stored right? so how do all threads concurrently access their data in 2 clock cycles without swapping?

avidday · March 31, 2010, 4:18pm

Each bank holds 256 32 bit words. Just like in LSChien’s excellent diagram, each 3 word structure is written across 3 sequential banks. So each thread per half-warp can read from one of the 16 banks per two clock cycle without conflicts, with each half-warp thread requiring 3 transactions to read the complete structure.

Dr.Synth · March 31, 2010, 5:40pm

ok i had assumed 1 bank storage size = 32bit , so 1 bank size is actually = 256*32bits (storing in my brain bank!) Could i know in which pdf this info is present.

ah 3 transactions. So how do you define a transaction? i mean is it a read/write operation per bank per thread in 2 clk cycles?

seibert · March 31, 2010, 5:52pm

They are just doing the math: size of shared memory in bytes / 4 bytes per 32 bit word / 16 banks

16384 bytes / 4 bytes/word / 16 banks = 256

The size of shared memory (along with other useful information) is given in Appendix A of the CUDA Programming Guide.

Dr.Synth · March 31, 2010, 7:18pm

great thanks got it. 16kb/16bnk = 1024byte/bank = 256*4bytes

Topic		Replies	Views
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3449	August 20, 2009
do not understand bank conflicts please help CUDA Programming and Performance	7	2683	December 22, 2012
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	972	November 17, 2017
question about the shared memory CUDA Programming and Performance	4	3864	October 30, 2007
shared memory bank conflicts cc 2.0 CUDA Programming and Performance	3	892	December 29, 2011
bank conflict in cuda's parallel prefix scan GPU-Accelerated Libraries	1	1883	February 12, 2016
Does every thread block have its own 32 shared memory banks? CUDA Programming and Performance cuda	8	1312	February 6, 2023
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6603	February 8, 2009
bank conflicts...? CUDA Programming and Performance	2	3320	January 30, 2008
How to understand the bank conflict of shared_mem CUDA Programming and Performance	6	6391	July 27, 2023

dont understand bank conflicts for shared mem

Related topics