Shared memory banks usage How to spread the data among banks ?

Romant · July 3, 2008, 9:24pm

Programming guide says that shared mem is splitted to 16 banks, 1K each. Half-warp can read data 16 times faster when each thread of the half wrap reads from different bank.

I’m trying to utilize it but without success - kernel begins to works 6 (!) times slower when trying to spread the data among shared mem rather then put it continuously.

In my task, each thread in block has it’s own set of data (in particular - set of stacks) that reside in the shared memory. Say, total size of shared memory required for each thread is 80 bytes. Also, my kernel consumes 48 bytes.

I try to put the 80 byte data bunches of successive threads to successive memory banks. So the data of the 0th thread in block resided in bank 0 with no offset, data of 1st thread in bank 1 with no offset, …, data of 16th thread in bank zero with offset of 80 e t c.

Kernel works and calculates correctly, however, 6(!) times slower then when all 80 bytes data sets were put one by one without thoughts about banks.

How to deal with banks correctly ? Really, I’m surprised with current results and would like to get as much of speed from shared mem as possible …

Thanks in advance!

curryml · July 3, 2008, 9:52pm

Index calculation time is significant compared to the access time of a bank of shared memory. You can think of it this way:

SHMEM can be accessed in four cycles if there are no conflicts, or N*4 cycles for N conflicts on a bank. This is the same amount of time required for the quickest arithmetic operations (add, subtract, __umul24). Other operations that may be required to do indexing take much longer (integer division and modulus are more than sixteen cycles). If you’re not careful, you can spend a whole lot longer calculating the conflict-free address than waiting for a serialized memory access. Another potential problem is fragmentation of the address space, if you’re splitting into single byte segments instead of 4-byte segments. This could lower the occupancy of your kernel.

In closing: Even if you have a bit of conflict between threads, shmem operations are quite fast! However, if you make any headway on this problem, be sure to let us know… I’ve similarly wondered about reducing conflict on kernel-private structures in shared memory.

Reimar · July 4, 2008, 6:32am

Huh? In my idea of “continously” there would not be any bank conflicts, in which case you actually replaced conflict-free access pattern by one full of conflicts (probably that’s not how you meant it, but you never know…).

curryml · July 4, 2008, 7:06am

As the data structure he’s referring to is wider than a bank, there would be bank conflicts where the structure extends over 3-4 banks if it’s placed into shared memory consecutively.

Topic		Replies	Views
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2631	March 31, 2010
the relation between Thread Index and Shared Memory CUDA Programming and Performance	4	3236	February 14, 2009
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1242	February 26, 2009
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7806	December 4, 2009
question about the shared memory CUDA Programming and Performance	4	3865	October 30, 2007
smem bank conflicts CUDA Programming and Performance	4	5040	September 30, 2008
Shared Memory "Bank Conflicts" I'am confused... CUDA Programming and Performance	11	3469	August 20, 2009
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	10131	January 16, 2025
Bank Conflicts CUDA Programming and Performance	2	1962	December 6, 2009
Shared memory access patterns CUDA Programming and Performance	2	1098	March 4, 2010

Shared memory banks usage How to spread the data among banks ?

Related topics