Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel?

I am experimenting with the cuda kernel for task 2 from cuda-training-series (btw: many thanks to NVIDIA and everyone else making this series available).

In the solution, we have

task2

[...]
smemArray[threadIdx.x][threadIdx.y] = \
    a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
    smemArray[threadIdx.y][threadIdx.x];
[...]

This is the output I got from the profiler tool

smem_cuda_transpose(int, const double *, double *), 2022-Jun-08 17:30:31, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio    sector/request                              8
    l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio    sector/request                              8
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                                                           88
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum                                                   16,012,462
    l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum                                                        1,048,664
    l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum                                                       17,061,038
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                        524,288
    l1tex__t_requests_pipe_lsu_mem_global_op_st.sum                                request                        524,288
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                      4,194,304
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum                                  sector                      4,194,304
    smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct                        %                            100
    smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct                        %                            100
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                           %                          11.76
    ---------------------------------------------------------------------- --------------- ------------------------------

Additionally, I am experimenting with a variation of this solution that has a different indexing for the shared memory.

task2b

[...]
smemArray[threadIdx.y][threadIdx.x] = \
    a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
    smemArray[threadIdx.x][threadIdx.y];
[...]

Let me put in a nicer format the extra metrics added for task 2:

Kernel task2 task2b
Shared Load Bank Conflicts 88 15,830,121
Shared Store Bank Conflicts 16,012,462 272,746
Shared Load Transactions 1,048,664 16,878,697
Shared Store Transactions 17,061,038 1,321,322
Shared Memory (efficiency %) 11.76 11.76

Questions

  1. If we have 4096 x 4096 matrix, total of 16,777,216 elements, for which each of them is stored and loaded only once to, and from, the shared memory. How come we have more transactions than elements (e.g., 17M+ Store transactions for task2, 18.8M+ Load Transactions for task2b)?

  2. Why the number of store and load transactions differ? Shouldn’t it be a 1:1 ratio in these cases?

  3. Why are the conflicts mostly in one direction (task2: Store Bank Conflicts; task2b Load )?

  4. I noticed these metrics slightly when profiling over and over the same kernel. How come the number of Shared Load Bank Conflicts vary based on run time? Is that due to the property of not-locked instructions within a warp that recent architectures have?