Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel?

I am experimenting with the cuda kernel for task 2 from cuda-training-series (btw: many thanks to NVIDIA and everyone else making this series available).

In the solution, we have

task2

[...]
smemArray[threadIdx.x][threadIdx.y] = \
    a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
    smemArray[threadIdx.y][threadIdx.x];
[...]

This is the output I got from the profiler tool

smem_cuda_transpose(int, const double *, double *), 2022-Jun-08 17:30:31, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio    sector/request                              8
    l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio    sector/request                              8
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                                                           88
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum                                                   16,012,462
    l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum                                                        1,048,664
    l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum                                                       17,061,038
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                        524,288
    l1tex__t_requests_pipe_lsu_mem_global_op_st.sum                                request                        524,288
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                      4,194,304
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum                                  sector                      4,194,304
    smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct                        %                            100
    smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct                        %                            100
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                           %                          11.76
    ---------------------------------------------------------------------- --------------- ------------------------------

Additionally, I am experimenting with a variation of this solution that has a different indexing for the shared memory.

task2b

[...]
smemArray[threadIdx.y][threadIdx.x] = \
    a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
    smemArray[threadIdx.x][threadIdx.y];
[...]

Let me put in a nicer format the extra metrics added for task 2:

Kernel task2 task2b
Shared Load Bank Conflicts 88 15,830,121
Shared Store Bank Conflicts 16,012,462 272,746
Shared Load Transactions 1,048,664 16,878,697
Shared Store Transactions 17,061,038 1,321,322
Shared Memory (efficiency %) 11.76 11.76

Questions

  1. If we have 4096 x 4096 matrix, total of 16,777,216 elements, for which each of them is stored and loaded only once to, and from, the shared memory. How come we have more transactions than elements (e.g., 17M+ Store transactions for task2, 18.8M+ Load Transactions for task2b)?

  2. Why the number of store and load transactions differ? Shouldn’t it be a 1:1 ratio in these cases?

  3. Why are the conflicts mostly in one direction (task2: Store Bank Conflicts; task2b Load )?

  4. I noticed these metrics slightly when profiling over and over the same kernel. How come the number of Shared Load Bank Conflicts vary based on run time? Is that due to the property of not-locked instructions within a warp that recent architectures have?

What is the GPU you are using?

Can you attach the NCU report for:

ncu --set full -o task2.ncu-rep ./task2

The GPU profiler should not be considered a numerically precise measure of activity, when compared to the numbers you would produce from code analysis. The metrics may be captured in ways that result in slight mismatches between the measured results and the actual results. For example, some metrics are measured on a single SM and then scaled for the number of SMs on the GPU. With an uneven distribution of blocks, this can lead to slightly incorrect measurements.

A bank conflict as we are seeing here results in a serialization of data retrieval or storage. In one case the bank conflicts are occurring on the load operations, and therefore the load operations are increased dramatically due to the 16-way bank conflict introduced (a single request takes 16 transactions). In the other case, the bank conflicts are occurring on the store operations, with a likewise increase in the store transactions.

This is due to the “orientation” of the data being requested from/to shared memory. Shared memory works efficiently when the data being requested (per thread) comes from different banks. As indicated in the instruction, we can think of banks as “columns” in shared memory, when shared memory is imagined to be a 2D array. This “efficient” direction is a row-wise direction, “horizontal”, where (in this case) adjacent threads are reading from adjacent banks. The “inefficient” direction in this mental model is the “column-wise” or “vertical” direction, where multiple threads are reading from the same bank. Due to the nature of the transpose operation, with the code realization up to this point we are reading “row-wise” and writing “column-wise” (or vice versa) as this is necessary to transpose data. Therefore one of the operations will be much more heavily encumbered by bank conflicts (the one being the one that is doing “column-wise” access).

See my response to the first question. The measurement process is not exact, and things like measurement scaling can be perturbed by run-to-run variation in the distribution of blocks by the GPU block scheduler, for example.