Perplexed by Global Load Transactions Per Request in P100

aravindsr · September 3, 2018, 5:03pm

Hi,

I am trying to make sense of “Global Load Transactions Per Request” in P100. The short version of the main question is what exactly is the size of global memory load transactions (is it 8/16/32 bytes) and if this is variable how is it determined. According to the tuning guide https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache both should be 32 bytes. (sorry in advance for the very long question).

The experiments were ran on a P100 GPU and is compiled with

nvcc -arch=sm_60 test.cu -O3

Here is the kernel (just a simple copy kernel). In all experiments the number of thread blocks is 1 and the threads per block is varied.

__global__ void testkernel(float *a, float *b, float *c, float n, float stride)
{
    int id = threadIdx.x; //I only have 1 thread block
    c[id] = a[id];
}

The kernel is launched as follows

//conf 1
testkernel<<<1, 8>>>(d_a, d_b, d_c, n, stride);

and nvprof returns

~
          1                          gld_transactions                  Global Load Transactions          10          10          10
          1                          gst_transactions                 Global Store Transactions           1           1           1
          1              gld_transactions_per_request      Global Load Transactions Per Request   10.000000   10.000000   10.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    1.000000    1.000000    1.000000

I was expecting “gld_transactions” to be 1 (just like ‘gst_transactions’). And 10 for gld_transactions was weired. Hence I modified the kernel as follows

__global__ void vecAdd(float *a, float *b, float *c, float n, float stride)
{
    int id = threadIdx.x;

    if(id>1000) // since we are launching 8 threads the `if' evaluates to false for all threads
       c[id] = a[id];
}

//conf 2
testkernel<<<1, 8>>>(d_a, d_b, d_c, n, stride);

and nvprof returns

.
          1                          gld_transactions                  Global Load Transactions           2           2           2
          1                          gst_transactions                 Global Store Transactions           0           0           0
          1              gld_transactions_per_request      Global Load Transactions Per Request    0.000000    0.000000    0.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    0.000000    0.000000    0.000000

From this experiment I think for some reason, the minimum number of “gld_transactions” is 2. Now this means that the actual number of gld_transactions in “conf1” is 8, which implies read transaction was of size 4 bytes. For the rest of the experiments I removed the if condition.

Things get even more confusing. if the above was true we should expect 16+2 transaction when there are 16 threads. But nvprof tells there are only 8+2 transactions

//conf 3 
testkernel<<<1, 16>>>(d_a, d_b, d_c, n, stride);
.
          1                          gld_transactions                  Global Load Transactions          10          10          10
          1                          gst_transactions                 Global Store Transactions           2           2           2
          1              gld_transactions_per_request      Global Load Transactions Per Request   10.000000   10.000000   10.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    2.000000    2.000000    2.000000

May be the transactions were of 8 bytes now. So what should we expect for 17 threads per block?. Well 16 threads may collectively perform 8 * 8 bytes transactions and the last thread may do a 4 byte transaction. So total should be (8+1+2). And that’s not what happens

//conf 4
testkernel<<<1, 17>>>(d_a, d_b, d_c, n, stride);
          1                          gld_transactions                  Global Load Transactions          14          14          14
          1                          gst_transactions                 Global Store Transactions           3           3           3
          1              gld_transactions_per_request      Global Load Transactions Per Request   14.000000   14.000000   14.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    3.000000    3.000000    3.000000

I dont know how to explain the above? But “gld_transactions” remains as 14 till 20 threads and it changes to 18 from 21 threads to 32 threads. And for 32 threads nvprof is as follows

//conf 5
testkernel<<<1, 32>>>(d_a, d_b, d_c, n, stride);

          1                          gld_transactions                  Global Load Transactions          18          18          18
          1                          gst_transactions                 Global Store Transactions           4           4           4
          1              gld_transactions_per_request      Global Load Transactions Per Request   18.000000   18.000000   18.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    4.000000    4.000000    4.000000

As you can see gld_transactions is 4 times gst_transactions (ignoring the 2 unexplained transactions). This suggests that the size of store transactions is 32 bytes but load is 8 bytes ( or some combination of 8 and 4 bytes). Does anyone know exactly how load transaction size is determined.

Additional: Volta

I also ran this code on Volta V100. There the number of transactions for load and store is as expected

//V100 data for 
testkernel<<<1, 32>>>(d_a, d_b, d_c, n, stride);
)
          1                          gld_transactions                  Global Load Transactions           4           4           4
          1                          gst_transactions                 Global Store Transactions           4           4           4
          1              gld_transactions_per_request      Global Load Transactions Per Request    4.000000    4.000000    4.000000
          1              gst_transactions_per_request     Global Store Transactions Per Request    4.000000    4.000000    4.000000

chengye200503 · January 9, 2020, 5:19am

I have the same problem,I can’t figure it out.

Topic		Replies	Views
Trying to understand Transactions per request for P100 CUDA Programming and Performance	2	1521	February 26, 2018
what does gld_request really mean? CUDA Programming and Performance	5	2259	January 18, 2010
Global load transaction count when in coalesced memory access Visual Profiler and nvprof	3	2268	July 7, 2017
Why reading one byte produces multiple global load l2 transactions? CUDA Programming and Performance	3	1268	August 30, 2018
What does gld_transaction mean in nvprof metrics? Visual Profiler and nvprof	8	2919	August 4, 2016
What is a memory transaction and a request? CUDA Programming and Performance	1	1163	March 6, 2020
coalesced reads CUDA Programming and Performance	3	1433	May 3, 2014
local memory loads limited to 64b gld transfers for CC>1.1 <2.0? CUDA Programming and Performance	3	811	November 20, 2010
Memory transactions CUDA Programming and Performance	1	1092	May 28, 2019
What does gld_transaction mean in nvprof metrics? CUDA Programming and Performance	0	533	July 21, 2016

Perplexed by Global Load Transactions Per Request in P100

Related topics