Shared memory banks usage How to spread the data among banks ?

Programming guide says that shared mem is splitted to 16 banks, 1K each. Half-warp can read data 16 times faster when each thread of the half wrap reads from different bank.

I’m trying to utilize it but without success - kernel begins to works 6 (!) times slower when trying to spread the data among shared mem rather then put it continuously.

In my task, each thread in block has it’s own set of data (in particular - set of stacks) that reside in the shared memory. Say, total size of shared memory required for each thread is 80 bytes. Also, my kernel consumes 48 bytes.

I try to put the 80 byte data bunches of successive threads to successive memory banks. So the data of the 0th thread in block resided in bank 0 with no offset, data of 1st thread in bank 1 with no offset, …, data of 16th thread in bank zero with offset of 80 e t c.

Kernel works and calculates correctly, however, 6(!) times slower then when all 80 bytes data sets were put one by one without thoughts about banks.

How to deal with banks correctly ? Really, I’m surprised with current results and would like to get as much of speed from shared mem as possible …

Thanks in advance!

Index calculation time is significant compared to the access time of a bank of shared memory. You can think of it this way:

SHMEM can be accessed in four cycles if there are no conflicts, or N*4 cycles for N conflicts on a bank. This is the same amount of time required for the quickest arithmetic operations (add, subtract, __umul24). Other operations that may be required to do indexing take much longer (integer division and modulus are more than sixteen cycles). If you’re not careful, you can spend a whole lot longer calculating the conflict-free address than waiting for a serialized memory access. Another potential problem is fragmentation of the address space, if you’re splitting into single byte segments instead of 4-byte segments. This could lower the occupancy of your kernel.

In closing: Even if you have a bit of conflict between threads, shmem operations are quite fast! However, if you make any headway on this problem, be sure to let us know… I’ve similarly wondered about reducing conflict on kernel-private structures in shared memory.

Huh? In my idea of “continously” there would not be any bank conflicts, in which case you actually replaced conflict-free access pattern by one full of conflicts (probably that’s not how you meant it, but you never know…).

As the data structure he’s referring to is wider than a bank, there would be bank conflicts where the structure extends over 3-4 banks if it’s placed into shared memory consecutively.