Perplexed by Global Load Transactions Per Request in P100

Hi,

I am trying to make sense of “Global Load Transactions Per Request” in P100. The short version of the main question is what exactly is the size of global memory load transactions (is it 8/16/32 bytes) and if this is variable how is it determined. According to the tuning guide https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache both should be 32 bytes. (sorry in advance for the very long question).

The experiments were ran on a P100 GPU and is compiled with

nvcc -arch=sm_60 test.cu -O3

Here is the kernel (just a simple copy kernel). In all experiments the number of thread blocks is 1 and the threads per block is varied.

__global__ void testkernel(float *a, float *b, float *c, float n, float stride)
{
    int id = threadIdx.x; //I only have 1 thread block
    c[id] = a[id];
}

The kernel is launched as follows

//conf 1
testkernel<<<1, 8>>>(d_a, d_b, d_c, n, stride);

and nvprof returns

~
          1                          gld_transactions                  Global Load Transactions          10          10          10
          1                          gst_transactions                 Global Store Transactions           1           1           1
          1              gld_transactions_per_request      Global Load Transactions Per Request   10.000000   10.000000   10.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    1.000000    1.000000    1.000000

I was expecting “gld_transactions” to be 1 (just like ‘gst_transactions’). And 10 for gld_transactions was weired. Hence I modified the kernel as follows

__global__ void vecAdd(float *a, float *b, float *c, float n, float stride)
{
    int id = threadIdx.x;

    if(id>1000) // since we are launching 8 threads the `if' evaluates to false for all threads
       c[id] = a[id];
}

//conf 2
testkernel<<<1, 8>>>(d_a, d_b, d_c, n, stride);

and nvprof returns

.
          1                          gld_transactions                  Global Load Transactions           2           2           2
          1                          gst_transactions                 Global Store Transactions           0           0           0
          1              gld_transactions_per_request      Global Load Transactions Per Request    0.000000    0.000000    0.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    0.000000    0.000000    0.000000

From this experiment I think for some reason, the minimum number of “gld_transactions” is 2. Now this means that the actual number of gld_transactions in “conf1” is 8, which implies read transaction was of size 4 bytes. For the rest of the experiments I removed the if condition.

Things get even more confusing. if the above was true we should expect 16+2 transaction when there are 16 threads. But nvprof tells there are only 8+2 transactions

//conf 3 
testkernel<<<1, 16>>>(d_a, d_b, d_c, n, stride);
.
          1                          gld_transactions                  Global Load Transactions          10          10          10
          1                          gst_transactions                 Global Store Transactions           2           2           2
          1              gld_transactions_per_request      Global Load Transactions Per Request   10.000000   10.000000   10.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    2.000000    2.000000    2.000000

May be the transactions were of 8 bytes now. So what should we expect for 17 threads per block?. Well 16 threads may collectively perform 8 * 8 bytes transactions and the last thread may do a 4 byte transaction. So total should be (8+1+2). And that’s not what happens

//conf 4
testkernel<<<1, 17>>>(d_a, d_b, d_c, n, stride);
          1                          gld_transactions                  Global Load Transactions          14          14          14
          1                          gst_transactions                 Global Store Transactions           3           3           3
          1              gld_transactions_per_request      Global Load Transactions Per Request   14.000000   14.000000   14.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    3.000000    3.000000    3.000000

I dont know how to explain the above? But “gld_transactions” remains as 14 till 20 threads and it changes to 18 from 21 threads to 32 threads. And for 32 threads nvprof is as follows

//conf 5
testkernel<<<1, 32>>>(d_a, d_b, d_c, n, stride);

          1                          gld_transactions                  Global Load Transactions          18          18          18
          1                          gst_transactions                 Global Store Transactions           4           4           4
          1              gld_transactions_per_request      Global Load Transactions Per Request   18.000000   18.000000   18.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    4.000000    4.000000    4.000000

As you can see gld_transactions is 4 times gst_transactions (ignoring the 2 unexplained transactions). This suggests that the size of store transactions is 32 bytes but load is 8 bytes ( or some combination of 8 and 4 bytes). Does anyone know exactly how load transaction size is determined.

Additional: Volta

I also ran this code on Volta V100. There the number of transactions for load and store is as expected

//V100 data for 
testkernel<<<1, 32>>>(d_a, d_b, d_c, n, stride);
)
          1                          gld_transactions                  Global Load Transactions           4           4           4
          1                          gst_transactions                 Global Store Transactions           4           4           4
          1              gld_transactions_per_request      Global Load Transactions Per Request    4.000000    4.000000    4.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    4.000000    4.000000    4.000000

I have the same problem,I can’t figure it out.