Programming guide says that shared mem is splitted to 16 banks, 1K each. Half-warp can read data 16 times faster when each thread of the half wrap reads from different bank.
I’m trying to utilize it but without success - kernel begins to works 6 (!) times slower when trying to spread the data among shared mem rather then put it continuously.
In my task, each thread in block has it’s own set of data (in particular - set of stacks) that reside in the shared memory. Say, total size of shared memory required for each thread is 80 bytes. Also, my kernel consumes 48 bytes.
I try to put the 80 byte data bunches of successive threads to successive memory banks. So the data of the 0th thread in block resided in bank 0 with no offset, data of 1st thread in bank 1 with no offset, …, data of 16th thread in bank zero with offset of 80 e t c.
Kernel works and calculates correctly, however, 6(!) times slower then when all 80 bytes data sets were put one by one without thoughts about banks.
How to deal with banks correctly ? Really, I’m surprised with current results and would like to get as much of speed from shared mem as possible …
Thanks in advance!