Understanding “shared_ld_transactions” in context of vector loads

v.miheer · September 19, 2020, 12:25am

We are trying to understand meaning of “shared_ld_transaction” nvprof event in context of vector shared memory loads on sm_60.

Looking at CUDA C++ Programming Guide which points to CUDA C++ Programming Guide,

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.

We have a kernel which is does following access:

reinterpret_cast<float2*>(reg)[0] = reinterpret_cast<float2*>(sbuf)[threadIdx.x%Input_Side_Width];

Consider Input_Side_Width is 4:

The total amount of unique data accessed by warp from shared memory is: 4 (distinct threads) * 2 (float 2) * 4 (sizeof(float)) = 32 bytes.

The access pattern confirms to the broadcast mechanism in Figure 18.
$ nvprof --query-events says:

shared_ld_transactions: Number of transactions for shared load accesses. Maximum transaction size in maxwell is 128 bytes, any warp accessing more that 128 bytes will cause multiple transactions for a shared load instruction. This also includes extra transactions caused by shared bank conflicts.

Request size in above access is clearly less than 128 bytes (32 bytes), furthermore nvprof event “shared_ld_bank_conflict” is 0 in this case.

According to 1 and 2, we would expect single transaction to satisfy the request, but profiler shows two transactions.
We are not able to figure out what we are missing.
Thanking you,
+miheer

Topic		Replies	Views
Does shared memory have "broadcast" behavior? CUDA Programming and Performance	8	3679	June 12, 2022
Shared load and store trancactions behavior CUDA Programming and Performance hw , cuda	2	456	November 8, 2021
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2634	March 31, 2010
A basic question about shared memory conflict of a simple example CUDA Programming and Performance	4	2288	October 4, 2013
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	978	November 17, 2017
Warp or thread level stats for memory metrics CUDA Programming and Performance	1	374	March 24, 2020
Requesting clarification for Non contiguous shared memory access by threads of a warp with no bank conflicts CUDA Programming and Performance hw , cuda	5	403	February 21, 2024
Shared memory bank conflict CUDA Programming and Performance	1	301	May 19, 2024
What does the "shared_efficiency" really mean? CUDA Programming and Performance	5	2358	November 16, 2023
Why reading one byte produces multiple global load l2 transactions? CUDA Programming and Performance	3	1192	August 30, 2018

Understanding “shared_ld_transactions” in context of vector loads

Related topics