P100, relationship between global transactions and texture reads inside the Unified L1/Texture memor...

Hello everybody!
I’m currently working on a “Tesla P100-SXM2-16GB” and having a tough time trying to understand how the Unified L1/texture memory works. I’ve run the cuda stream-benchmark and analyzed the Copy kernel using the Visual Profiler. The Copy kernel is a simple code that transfer the data from array A to array B.

template <typename T>
__global__ void STREAM_Copy(T const * __restrict__ const a, T * __restrict__ const b, int len)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < len)
        b[idx] = a[idx];
}

Each array has 2^26 points of double precision, a total of 512MiB reads and writes. As a result, I got from the Visual Profiler the follow

Unified Cache
--------------------------------------------+
              | Transactions |    Bandwidth |
--------------+--------------+--------------+
Global Loads  |     67108866 | 271.383 GB/s |
--------------+--------------+--------------+
Global Stores |     16777216 | 271.383 GB/s |
--------------+--------------+--------------+
Texture Reads |     16777216 | 271.383 GB/s |
--------------+--------------+--------------+
Unified Total |   100663298  | 814.149 GB/s |
--------------+--------------+--------------+

What I want to know is:

  1. What is the relationship between global loads and texture reads?
  2. Why I always have the follow result: global_load_transactions = 4 * texture_read_transactions + 2? (this is not my only application)
  3. If there is a close relationship between global loads and texture reads, why the total bandwidth is the sum of both?

I’m asking those questions because the Unified L1/Texture memory bandwidth is the bottleneck of my main application and it requires optimization. The Visual Profiler is even saying that “For this kernel the limiting factor in the memory system is the bandwidth of the Texture Memory”, which leads me to my final question: there is a limit for the texture memory bandwidth smaller then the Unified total bandwidth? If that’s so, what is the limit?

Thanks!

What is the type T used in your specific benchmark?

Christian

The type T is double and each array has 2^26 points, which gives a size of 512MiB for each array. Pay attention at the number of transactions:

Global Loads:  67108866 (2^26 +2)
Global Stores: 16777216 (2^24)
Tecture Reads: 16777216 (2^24)

If we suppose that the Texture Reads returns 32B, it is giving us 512MiB as we expected, but it doesn’t explains the value of Global Reads. If we suppose that each Global Reads returns a double, it gives us 512MiB but we cannot explain the Texture Read transactions.

This is a bit foggy but from the top of my head:

The profilers way of reporting # of transactions an be a little confusing, they are not always the same size and as I can remember the documentation is a bit thin.

The doc states that:
The L2 services cache lines of 128B at a time while the L2->L1/Tex works at 32Bytes.

But as you’ve noted: :
To read 512 MiB of double values into the L2 cache you have exactly 67108866 8-byte reads to make (51210241024/67108866 = 8 bytes / transaction).

Now in the case of “on-chip” transactions the profile appears to count the number of 32-byte transactions:

32-bytes/trans * 16777216 transactions => 512 MiB.