We are trying to understand meaning of “shared_ld_transaction” nvprof event in context of vector shared memory loads on sm_60.
Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.
We have a kernel which is does following access:
reinterpret_cast<float2*>(reg) = reinterpret_cast<float2*>(sbuf)[threadIdx.x%Input_Side_Width];
Input_Side_Width is 4:
The total amount of unique data accessed by warp from shared memory is: 4 (distinct threads) * 2 (float 2) * 4 (sizeof(float)) = 32 bytes.
- The access pattern confirms to the broadcast mechanism in Figure 18.
- $ nvprof --query-events says:
shared_ld_transactions: Number of transactions for shared load accesses. Maximum transaction size in maxwell is 128 bytes, any warp accessing more that 128 bytes will cause multiple transactions for a shared load instruction. This also includes extra transactions caused by shared bank conflicts.
Request size in above access is clearly less than 128 bytes (32 bytes), furthermore nvprof event “shared_ld_bank_conflict” is 0 in this case.
According to 1 and 2, we would expect single transaction to satisfy the request, but profiler shows two transactions.
We are not able to figure out what we are missing.