Hello,
I am trying to understand the effects of L1 and L2 caching in CUDA and have some trouble matching some profiling results with the statements in the programming guide. Specifically, I cannot seem to produce results consistent with the 128 byte memory transactions that should be used with L1 caching of global memory accesses enabled (using ‘-Xptxas -dlcm=ca’).
For profiling, I use the following (obviously inefficient) code:
#define NUM_ELEMENTS (1000000)
__device__ int array[NUM_ELEMENTS];
__device__ int sum = 0;
__global__ void kernel_forward()
{
int s = 0;
for(int i = 0; i < NUM_ELEMENTS; i++)
s += array[i];
sum = s;
}
int main(int argc, char **argv)
{
kernel_forward<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
The results with L2-only caching (‘-Xptxas -dlcm=cg’ or no such argument) are as one might expect, i.e., a cache miss for every 8th 4-byte word:
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 87.50% 87.50% 87.50%
1 tex_cache_hit_rate Unified Cache Hit Rate 0.00% 0.00% 0.00%
1 global_hit_rate Global Hit Rate 0.00% 0.00% 0.00%
1 dram_read_transactions Device Memory Read Transactions 645084 645084 645084
1 l2_read_transactions L2 Read Transactions 1535211 1535211 1535211
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 1000000 1000000 1000000
With ‘-Xptxas -dlcm=ca’, I would have expected an L1 hit rate of (1 - 1/32) = 96.875%, due to this statement in the programming guide, referring to CC 2.x but referenced in descriptions of the subsequent CCs as well:
“A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions.”
However, the results indicate that although cache hits moved to L1, there is still a cache miss for every 8th element:
1 l2_tex_read_hit_rate L2 Hit Rate (Texture Reads) 0.00% 0.00% 0.00%
1 tex_cache_hit_rate Unified Cache Hit Rate 87.50% 87.50% 87.50%
1 global_hit_rate Global Hit Rate 87.50% 87.50% 87.50%
1 dram_read_transactions Device Memory Read Transactions 603701 603701 603701
1 l2_read_transactions L2 Read Transactions 608022 608022 608022
1 l2_tex_read_transactions L2 Transactions (Texture Reads) 125000 125000 125000
The above results were generated on a GTX 950. However, a GTX 850 Ti and a GTX 1050 showed the same cache hit rates. Am I misinterpreting the results or conducting the wrong experiment, or is the memory transaction size actually always 32 bytes, regardless of the cache configuration?
Thanks for any hints,
PA