Hi,
I am trying to make sense of “Global Load Transactions Per Request” in P100. The short version of the main question is what exactly is the size of global memory load transactions (is it 8/16/32 bytes) and if this is variable how is it determined. According to the tuning guide https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html#l1-cache both should be 32 bytes. (sorry in advance for the very long question).
The experiments were ran on a P100 GPU and is compiled with
nvcc -arch=sm_60 test.cu -O3
Here is the kernel (just a simple copy kernel). In all experiments the number of thread blocks is 1 and the threads per block is varied.
__global__ void testkernel(float *a, float *b, float *c, float n, float stride)
{
int id = threadIdx.x; //I only have 1 thread block
c[id] = a[id];
}
The kernel is launched as follows
//conf 1
testkernel<<<1, 8>>>(d_a, d_b, d_c, n, stride);
and nvprof returns
~
1 gld_transactions Global Load Transactions 10 10 10
1 gst_transactions Global Store Transactions 1 1 1
1 gld_transactions_per_request Global Load Transactions Per Request 10.000000 10.000000 10.000000
1 gst_transactions_per_request Global Store Transactions Per Request 1.000000 1.000000 1.000000
I was expecting “gld_transactions” to be 1 (just like ‘gst_transactions’). And 10 for gld_transactions
was weired. Hence I modified the kernel as follows
__global__ void vecAdd(float *a, float *b, float *c, float n, float stride)
{
int id = threadIdx.x;
if(id>1000) // since we are launching 8 threads the `if' evaluates to false for all threads
c[id] = a[id];
}
//conf 2
testkernel<<<1, 8>>>(d_a, d_b, d_c, n, stride);
and nvprof returns
.
1 gld_transactions Global Load Transactions 2 2 2
1 gst_transactions Global Store Transactions 0 0 0
1 gld_transactions_per_request Global Load Transactions Per Request 0.000000 0.000000 0.000000
1 gst_transactions_per_request Global Store Transactions Per Request 0.000000 0.000000 0.000000
From this experiment I think for some reason, the minimum number of “gld_transactions” is 2. Now this means that the actual number of gld_transactions in “conf1” is 8, which implies read transaction was of size 4 bytes. For the rest of the experiments I removed the if condition.
Things get even more confusing. if the above was true we should expect 16+2 transaction when there are 16 threads. But nvprof tells there are only 8+2 transactions
//conf 3
testkernel<<<1, 16>>>(d_a, d_b, d_c, n, stride);
.
1 gld_transactions Global Load Transactions 10 10 10
1 gst_transactions Global Store Transactions 2 2 2
1 gld_transactions_per_request Global Load Transactions Per Request 10.000000 10.000000 10.000000
1 gst_transactions_per_request Global Store Transactions Per Request 2.000000 2.000000 2.000000
May be the transactions were of 8 bytes now. So what should we expect for 17 threads per block?. Well 16 threads may collectively perform 8 * 8 bytes transactions and the last thread may do a 4 byte transaction. So total should be (8+1+2). And that’s not what happens
//conf 4
testkernel<<<1, 17>>>(d_a, d_b, d_c, n, stride);
1 gld_transactions Global Load Transactions 14 14 14
1 gst_transactions Global Store Transactions 3 3 3
1 gld_transactions_per_request Global Load Transactions Per Request 14.000000 14.000000 14.000000
1 gst_transactions_per_request Global Store Transactions Per Request 3.000000 3.000000 3.000000
I dont know how to explain the above? But “gld_transactions” remains as 14 till 20 threads and it changes to 18 from 21 threads to 32 threads. And for 32 threads nvprof is as follows
//conf 5
testkernel<<<1, 32>>>(d_a, d_b, d_c, n, stride);
1 gld_transactions Global Load Transactions 18 18 18
1 gst_transactions Global Store Transactions 4 4 4
1 gld_transactions_per_request Global Load Transactions Per Request 18.000000 18.000000 18.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000
As you can see gld_transactions is 4 times gst_transactions (ignoring the 2 unexplained transactions). This suggests that the size of store transactions is 32 bytes but load is 8 bytes ( or some combination of 8 and 4 bytes). Does anyone know exactly how load transaction size is determined.
Additional: Volta
I also ran this code on Volta V100. There the number of transactions for load and store is as expected
//V100 data for
testkernel<<<1, 32>>>(d_a, d_b, d_c, n, stride);
)
1 gld_transactions Global Load Transactions 4 4 4
1 gst_transactions Global Store Transactions 4 4 4
1 gld_transactions_per_request Global Load Transactions Per Request 4.000000 4.000000 4.000000
1 gst_transactions_per_request Global Store Transactions Per Request 4.000000 4.000000 4.000000