Understanding “shared_ld_transactions” in context of vector loads


We are trying to understand meaning of “shared_ld_transaction” nvprof event in context of vector shared memory loads on sm_60.

Looking at https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-6-x which points to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x,

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.

We have a kernel which is does following access:

reinterpret_cast<float2*>(reg)[0] = reinterpret_cast<float2*>(sbuf)[threadIdx.x%Input_Side_Width];

Consider Input_Side_Width is 4:

The total amount of unique data accessed by warp from shared memory is: 4 (distinct threads) * 2 (float 2) * 4 (sizeof(float)) = 32 bytes.

  1. The access pattern confirms to the broadcast mechanism in Figure 18.
  2. $ nvprof --query-events says:

    shared_ld_transactions: Number of transactions for shared load accesses. Maximum transaction size in maxwell is 128 bytes, any warp accessing more that 128 bytes will cause multiple transactions for a shared load instruction. This also includes extra transactions caused by shared bank conflicts.

Request size in above access is clearly less than 128 bytes (32 bytes), furthermore nvprof event “shared_ld_bank_conflict” is 0 in this case.

According to 1 and 2, we would expect single transaction to satisfy the request, but profiler shows two transactions.
We are not able to figure out what we are missing.
Thanking you,