Trying to understand Transactions per request for P100

Hi,

I am trying to understand transactions per request for P100.

My kernel is

__global__ void vecAdd(char *a, char *b, char *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    c[id] = a[id] + 1;
}

To my understanding, each DRAM transaction is of size 32 bytes. Nvprof shows

Global Load Transactions Per Request 4.0
Global Store Transactions Per Request 1.000000
Device Memory Read Transactions 2097165
Device Memory Write Transactions 2105237

my question is

  1. Why is “Global Load Transactions Per Request” 4? It should be 1 right?. (The data is aligned)
  2. Assuming “Global Load Transactions Per Request” is 4, the number of “Device Memory Read Transactions” should be (4 * “Device Memory Write Transactions”). But according to nvprof, the number of dram transactions for read and write is similar. Why is it so

======= part 2 ========

When I change the datatype to double," Global Load Transactions Per Request (and write)" is 8.0. Why is it 8? shouldn’t it be 2.0?

1 Like

global != device. by “global” transactions it means operations in Global memory space (as opposite to Local, Shared, Texture, Constant spaces). So, it’s amount of transactions going to L1 or L2 cache

My wild guess is that each warp process 32*4=128 bytes, which is 4 read transactions (because they are 32-byte long), but 1 write transaction (so they are 128-byte long)

then transactions are emitted from L2 cache to DRAM, and these are OTHER sort of transactions, so they may be measured in entirely different units

Thanks for the reply.

In the first the datatype of a,b,c; Hence it only need 32 bytes for a warp… Hence there should be only 1 read transaction. (write seems to be correct). Since this benchmark is basically streaming values of “c” and “a”, L2 hit rate is minimal (nvprof shows similar transactions for both L2 and dram).

" then transactions are emitted from L2 cache to DRAM, and these are OTHER sort of transactions, so they may be measured in entirely different units "

In kepler, it made sense to look at transactions per request to get an idea about the number of uncoalesed access. My question is then

  1. How can I quantify the amount of uncoalesed accesses
  2. What is the exact definition of Transactions Per Request in pascal. If it composed of multiple transaction types, how can I decompose it?