L1 above ideal transactions conflicting with L1 transfer overhead?

The nSight visual profiler under source view has a column called ‘L1 above ideal transactions’ which i’m having trouble understanding. I’m not an expert, but I have a good understanding of efficient global memory access, coalescing, et cetera, but this has me stumped. The documentation defines it as ‘Number of 128-byte transactions between the SM and L1 that exceeded the ideal count for the made request.’ but ‘count’ is not defined. Count of what? And the description ends with ‘Use the Instruction Count source-level experiment to identify instructions which execute with less than a full warp.’ so this statistic must have something to do with execution of partial warps. I thought I understood this until I found a line in my code in which this statistic is zero, but ‘L1 transfer overhead’ is 3! How can this be? I realize now that I do not understand it. Can anyone elucidate? Thanks!

“L1 above ideal transactions”

could this not also be interpreted as ‘least possible’ transactions?

if x bytes must be transferred, and a transaction can mind y bytes, then under ideal circumstances, it should take z transactions, also the minimum/ least possible transactions

under non-ideal circumstances, non-coalescence, etc, it may take c transactions, with c > z

the profiler knows full well the minimum/ least possible transactions, and can thus compare this to the actual transactions

but, i am no expert either

could this not also be interpreted as ‘least possible’ transactions? <<

Little Jimmy - If I’m reading your comment right, then your interpretation is just how I assumed it was defined. But then I saw a line of my code in which this statistic was zero (the usual, as I am careful about memory coalescing), but this same line had an L1 transfer overhead of 3, when the overhead should have been 1, the perfect ratio. I don’t see how I can get these two apparently conflicting stats in the same line of code. So I must be misunderstanding something.

Tim

The definition in the Nsight VSE Analysis Tools documentation states (http://docs.nvidia.com/gameworks/index.html#developertools/desktop/analysis/report/cudaexperiments/sourcelevel/memorytransactions.htm)
L1 Above-Ideal Transactions

Number of 128-byte transactions between the SM and L1 that exceeded the ideal count for the made request. The calculation of ideal transactions is based on aligned, sequential data and takes the predication mask and the active mask into account. For an access pattern where multiple threads access the same data, it is possible to achieve better than ideal transactions. Note that this would result in a value for Above Ideal Transactions that is lower than expected, or even negative. Use the Instruction Count source-level experiment to identify instructions which execute with less than a full warp.

On each instruction the experiment looks at the warp active mask, instruction predicate, and the per thread address and calculates the number of SM <-> L1 transactions that are required due to address conflicts. The ideal value is calculated as if each active and not predicated off thread accesssed a linear address based upon its lane ID. If all lanes were active then the ideal transaction count is

  • 32-bit ld/st requires 1 transaction
  • 64-bit ld/st requires 2 transactions
  • 128-bit ld/st requires 4 transactions

The worst case if all 32-threads are active is 32 transactions. This occurs if all 32 thread access a different 128 byte cache line. For a 32-bit ld this would mean a value of 31 “L1 Above-Ideal Transactions” for that instruction.

Without testing the equation I would say that the following code would produce

L1 Above-Ideal Transactions = 0
L1 Transfer Overhead = 3

int main(void)
{
    // cudaMalloc 128 bytes
    test<<<1, 32>>>(pbytes);
    cudaDeviceSynchronize();
    // cudaFree
    return 0;
}

__global__ void test(uint8_t* pbytes)
{
    int id = threadIdx.x;
    pbytes[id] = 0;
}

This will access 32 consecutive bytes in a 1 transaction. The number of ideal transactions is 1. The number of bytes requested is 32 bytes. The number of bytes read is 128 bytes. 128 bytes / 32 bytes = 3.

I do not think the requested bytes experiment calculates the number of unique bytes in the cache line that were accessed so the same result may be given if all 32 threads access the same byte in the cache line.

__global__ void test2(uint8_t* pbytes)
{
    *pbytes = 0;
}

Greg - Thank you! That clarifies the situation.

Tim