I am experimenting with the cuda kernel for task 2 from cuda-training-series (btw: many thanks to NVIDIA and everyone else making this series available).
In the solution, we have
task2
[...]
smemArray[threadIdx.x][threadIdx.y] = \
a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
smemArray[threadIdx.y][threadIdx.x];
[...]
This is the output I got from the profiler tool
smem_cuda_transpose(int, const double *, double *), 2022-Jun-08 17:30:31, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio sector/request 8
l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio sector/request 8
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 88
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 16,012,462
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum 1,048,664
l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum 17,061,038
l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 524,288
l1tex__t_requests_pipe_lsu_mem_global_op_st.sum request 524,288
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 4,194,304
l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum sector 4,194,304
smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct % 100
smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct % 100
smsp__sass_average_data_bytes_per_wavefront_mem_shared.pct % 11.76
---------------------------------------------------------------------- --------------- ------------------------------
Additionally, I am experimenting with a variation of this solution that has a different indexing for the shared memory.
task2b
[...]
smemArray[threadIdx.y][threadIdx.x] = \
a[INDX(tileX+threadIdx.x, tileY+threadIdx.y, m)];
[...]
c[INDX(tileY+threadIdx.x, tileX+threadIdx.y, m)] = \
smemArray[threadIdx.x][threadIdx.y];
[...]
Let me put in a nicer format the extra metrics added for task 2:
Kernel | task2 |
task2b |
---|---|---|
Shared Load Bank Conflicts | 88 | 15,830,121 |
Shared Store Bank Conflicts | 16,012,462 | 272,746 |
Shared Load Transactions | 1,048,664 | 16,878,697 |
Shared Store Transactions | 17,061,038 | 1,321,322 |
Shared Memory (efficiency %) | 11.76 | 11.76 |
Questions
-
If we have
4096
x4096
matrix, total of16,777,216
elements, for which each of them is stored and loaded only once to, and from, the shared memory. How come we have more transactions than elements (e.g., 17M+ Store transactions fortask2
, 18.8M+ Load Transactions fortask2b
)? -
Why the number of store and load transactions differ? Shouldn’t it be a
1:1
ratio in these cases? -
Why are the conflicts mostly in one direction (
task2
: Store Bank Conflicts;task2b
Load )? -
I noticed these metrics slightly when profiling over and over the same kernel. How come the number of Shared Load Bank Conflicts vary based on run time? Is that due to the property of not-locked instructions within a warp that recent architectures have?