Hello everybody!
I’m currently working on a “Tesla P100-SXM2-16GB” and having a tough time trying to understand how the Unified L1/texture memory works. I’ve run the cuda stream-benchmark and analyzed the Copy kernel using the Visual Profiler. The Copy kernel is a simple code that transfer the data from array A to array B.
template <typename T>
__global__ void STREAM_Copy(T const * __restrict__ const a, T * __restrict__ const b, int len)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < len)
b[idx] = a[idx];
}
Each array has 2^26 points of double precision, a total of 512MiB reads and writes. As a result, I got from the Visual Profiler the follow
Unified Cache
--------------------------------------------+
| Transactions | Bandwidth |
--------------+--------------+--------------+
Global Loads | 67108866 | 271.383 GB/s |
--------------+--------------+--------------+
Global Stores | 16777216 | 271.383 GB/s |
--------------+--------------+--------------+
Texture Reads | 16777216 | 271.383 GB/s |
--------------+--------------+--------------+
Unified Total | 100663298 | 814.149 GB/s |
--------------+--------------+--------------+
What I want to know is:
- What is the relationship between global loads and texture reads?
- Why I always have the follow result: global_load_transactions = 4 * texture_read_transactions + 2? (this is not my only application)
- If there is a close relationship between global loads and texture reads, why the total bandwidth is the sum of both?
I’m asking those questions because the Unified L1/Texture memory bandwidth is the bottleneck of my main application and it requires optimization. The Visual Profiler is even saying that “For this kernel the limiting factor in the memory system is the bandwidth of the Texture Memory”, which leads me to my final question: there is a limit for the texture memory bandwidth smaller then the Unified total bandwidth? If that’s so, what is the limit?
Thanks!