About compute accuracy

Really check, if the input data to mma is the same. You should have found the bug with the double indirection and perhaps there is another bug.

I’m still confuses about this thing. I know access to global memory should be coalesced to 32B/64B/128B, but will three choice leads to significant difference in efficiency?

If you just copy from global memory to shared memory, there should not be much difference.
With a small bit more efficiency with the larger sizes.
(Whereas the 16-bit accesses are really slower, at least they cost more L1 bandwidth.)

However, if you want to reorder the data with a kernel before storing to shared memory, it can make a difference, as the data is read by different threads in each of the three cases:

The lowest coalescing size is 32 bytes.
E.g. you can use 128 bit accesses and load 32 bytes into 2 neighbouring threads each (128 bits = 16 bytes) and then do shuffle between the two threads to move data from thread 1 to thread 0. That means consecutive memory locations arrive in the same thread. That could be helpful for combining two specific half values into one 32 bit value before storing to shared memory.