about shared memory's contribution to performance when global memory access is coalesced

Hi everyone:

I has two kernels, which are almost the same except an operation of loading from the global memory and then making a multiplication(bold text). I timed this 2 kernels, which displayed that the kernel using the shared memory is 1/4 faster than the other. The following is the defination of these 2 kernels.

According to the CUDA docs, if the global memory access is coalesced, 16 independent memory transactions will be merged into 1 memory transaction, which will result in highly efficient memory access. But what is the shared memory’s contribution to the performance if the memory access has coalesced? Under such a condition, why is there a performance difference between using shared memory and absence of shared memory? These are my puzzle, can anybody help me figure them out?

Thanks in advance!