Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel?

eduardo4jesus · June 9, 2022, 2:33am

I am experimenting with the cuda kernel for task 2 from cuda-training-series (btw: many thanks to NVIDIA and everyone else making this series available).

In the solution, we have

task2

[...]
smemArray[threadIdx.x][threadIdx.y] = \
    a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
    smemArray[threadIdx.y][threadIdx.x];
[...]

This is the output I got from the profiler tool

smem_cuda_transpose(int, const double *, double *), 2022-Jun-08 17:30:31, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio    sector/request                              8
    l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio    sector/request                              8
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum                                                           88
    l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum                                                   16,012,462
    l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum                                                        1,048,664
    l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum                                                       17,061,038
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                        524,288
    l1tex__t_requests_pipe_lsu_mem_global_op_st.sum                                request                        524,288
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                      4,194,304
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum                                  sector                      4,194,304
    smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct                        %                            100
    smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct                        %                            100
    smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct                           %                          11.76
    ---------------------------------------------------------------------- --------------- ------------------------------

Additionally, I am experimenting with a variation of this solution that has a different indexing for the shared memory.

task2b

[...]
smemArray[threadIdx.y][threadIdx.x] = \
    a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
    smemArray[threadIdx.x][threadIdx.y];
[...]

Let me put in a nicer format the extra metrics added for task 2:

Kernel	`task2`	`task2b`
Shared Load Bank Conflicts	88	15,830,121
Shared Store Bank Conflicts	16,012,462	272,746
Shared Load Transactions	1,048,664	16,878,697
Shared Store Transactions	17,061,038	1,321,322
Shared Memory (efficiency %)	11.76	11.76

Questions

If we have 4096 x 4096 matrix, total of 16,777,216 elements, for which each of them is stored and loaded only once to, and from, the shared memory. How come we have more transactions than elements (e.g., 17M+ Store transactions for task2, 18.8M+ Load Transactions for task2b)?
Why the number of store and load transactions differ? Shouldn’t it be a 1:1 ratio in these cases?
Why are the conflicts mostly in one direction (task2: Store Bank Conflicts; task2b Load )?
I noticed these metrics slightly when profiling over and over the same kernel. How come the number of Shared Load Bank Conflicts vary based on run time? Is that due to the property of not-locked instructions within a warp that recent architectures have?

Sanjiv.Satoor · July 5, 2022, 10:43am

What is the GPU you are using?

Can you attach the NCU report for:

ncu --set full -o task2.ncu-rep ./task2

Robert_Crovella · July 5, 2022, 4:11pm

The GPU profiler should not be considered a numerically precise measure of activity, when compared to the numbers you would produce from code analysis. The metrics may be captured in ways that result in slight mismatches between the measured results and the actual results. For example, some metrics are measured on a single SM and then scaled for the number of SMs on the GPU. With an uneven distribution of blocks, this can lead to slightly incorrect measurements.

A bank conflict as we are seeing here results in a serialization of data retrieval or storage. In one case the bank conflicts are occurring on the load operations, and therefore the load operations are increased dramatically due to the 16-way bank conflict introduced (a single request takes 16 transactions). In the other case, the bank conflicts are occurring on the store operations, with a likewise increase in the store transactions.

This is due to the “orientation” of the data being requested from/to shared memory. Shared memory works efficiently when the data being requested (per thread) comes from different banks. As indicated in the instruction, we can think of banks as “columns” in shared memory, when shared memory is imagined to be a 2D array. This “efficient” direction is a row-wise direction, “horizontal”, where (in this case) adjacent threads are reading from adjacent banks. The “inefficient” direction in this mental model is the “column-wise” or “vertical” direction, where multiple threads are reading from the same bank. Due to the nature of the transpose operation, with the code realization up to this point we are reading “row-wise” and writing “column-wise” (or vice versa) as this is necessary to transpose data. Therefore one of the operations will be much more heavily encumbered by bank conflicts (the one being the one that is doing “column-wise” access).

See my response to the first question. The measurement process is not exact, and things like measurement scaling can be perturbed by run-to-run variation in the distribution of blocks by the GPU block scheduler, for example.

Topic		Replies	Views
Why will different numbers of thread blocks cause bank conflicts? CUDA Programming and Performance cuda , kernel , ubuntu	1	33	August 22, 2024
Why there is random bank conflicts? CUDA-MEMCHECK cuda	2	1197	September 19, 2023
Questions about "L1 Conflicts Shared N-way" & metrics related to "Excessive" CUDA Programming and Performance	5	455	April 1, 2024
About the number of store transactions on pascal CUDA Programming and Performance	0	343	June 1, 2020
Transposing the grid, but not its thread-blocks, results in slower performance, why ? CUDA Programming and Performance	9	907	March 2, 2018
Efficient use of shared memory CUDA Programming and Performance	29	4098	December 2, 2019
How to profile such metrics like l1tex__data_bank_conflicts_pipe_lsu_mem_global? Nsight Compute cuda	2	785	March 3, 2022
Simple application not scaling well, trying to figure out reason(s) CUDA Programming and Performance	6	957	July 31, 2015
A basic question about shared memory conflict of a simple example CUDA Programming and Performance	4	2275	October 4, 2013
Problem about bank conflict test CUDA Programming and Performance	6	435	March 12, 2024

Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel?

Related topics