Next topic is the speed of load/store operations itself. This depends on GPU generation, so i will talk about maxwell/paswell. Each SM here has 4 dispatchers. Each dispatcher has 32-wide ALU and 8-wide LD/ST engine. So,each cycle dispatcher can start ALU operation, but it can start LD/ST operation only once per 4 cycles. AFAIK, bandwidth of shared memory and L1 cache access is limited only by this value. So, copying data from L1$ to shared memory require 8 GPU cycles. I.e. simultaneously with such copying, up to 8 ALU operations can be performed.
But delays are pretty high - AFAIR, minimal delays of shmem/L1$ access are 30-50 cycles. Overall, GPU operation delay define when the next operation on its result can be started. I.e. if you only load data into register - thread will continue execution until this register will be used in other operation. When you perform “shmem = global” data copying, this means that we will have delay of 4 cycles for loading into register, then 30-50 cycles for data arrived, then 4 cycles for storing into shmem. If this operation followed by syncthreads(), we may need to wait another 30-50 cycles for data to be stored - i don’t know exactly (may be txbob, njuffa or greg can clarify that?).
So, the code we discussing will be executed at least 40 cycles, with only a few LD/ST and especially ALU operations executed. And here extra thread blocks can go a rescue!! While this thread block wastes time waiting for data moving, threads in another thread blocks may perform calculation part of their code. When the second thread block will start moving data, the first one may start performing calculations. It’s why it’s crucial for such style of kernel (load data into shmem → syncthreads → perform calculations) to have at least 2 thread blocks, and may be even more.
Of course, perfromance increase as function of thread blocks number depends on how much time you spend moving data. If it’s only 10% of overall time, two blocks will increase speed by 9% and third block by less than 1%. OTOH, if moving spends 50% of overall time, adding more than two blocks can add another 10-20% of performance.
But there is another way to increase performance - delayed moving. If you have enough spare registers, you can overlap loading into register set #1 with storing from register set #2, and swap their roles at the next cycle:
…
store set #2
load set #1
syncthreads()
calculations()
store set #1
load set #2
syncthreads()
calculations()
…
going further, you need to schedule stores as far as possible from the next syncthreads() in order to give them more time to finish, so even better code:
…
store set #1 - point 1
load set #2
calculations()
syncthreads()
store set #2 - point 2
load set #1
calculations() using data stored at point 1
syncthreads()
…
of course, this require to double shared memory usage since you need to have separate places for data stored at points 1 and 2. You may select your strategy (increasing amount of thread blocks, using two register sets, using two shared memory places) depending on what most limits your kernel - registers, shared memory, number of threads.