Memory transaction size

Hello,

I am trying to understand the effects of L1 and L2 caching in CUDA and have some trouble matching some profiling results with the statements in the programming guide. Specifically, I cannot seem to produce results consistent with the 128 byte memory transactions that should be used with L1 caching of global memory accesses enabled (using ‘-Xptxas -dlcm=ca’).

For profiling, I use the following (obviously inefficient) code:

#define NUM_ELEMENTS (1000000)

__device__ int array[NUM_ELEMENTS];
__device__ int sum = 0;

__global__ void kernel_forward()
{
  int s = 0;
  for(int i = 0; i < NUM_ELEMENTS; i++)
    s += array[i];
  sum = s;
}

int main(int argc, char **argv)
{
  kernel_forward<<<1, 1>>>();
  cudaDeviceSynchronize();

  return 0;
}

The results with L2-only caching (‘-Xptxas -dlcm=cg’ or no such argument) are as one might expect, i.e., a cache miss for every 8th 4-byte word:

1                      l2_tex_read_hit_rate                   L2 Hit Rate (Texture Reads)      87.50%      87.50%      87.50%
1                        tex_cache_hit_rate                        Unified Cache Hit Rate       0.00%       0.00%       0.00%
1                           global_hit_rate                               Global Hit Rate       0.00%       0.00%       0.00%
1                    dram_read_transactions               Device Memory Read Transactions      645084      645084      645084
1                      l2_read_transactions                          L2 Read Transactions     1535211     1535211     1535211
1                  l2_tex_read_transactions               L2 Transactions (Texture Reads)     1000000     1000000     1000000

With ‘-Xptxas -dlcm=ca’, I would have expected an L1 hit rate of (1 - 1/32) = 96.875%, due to this statement in the programming guide, referring to CC 2.x but referenced in descriptions of the subsequent CCs as well:

“A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions.”

However, the results indicate that although cache hits moved to L1, there is still a cache miss for every 8th element:

1                      l2_tex_read_hit_rate                   L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
1                        tex_cache_hit_rate                        Unified Cache Hit Rate      87.50%      87.50%      87.50%
1                           global_hit_rate                               Global Hit Rate      87.50%      87.50%      87.50%
1                    dram_read_transactions               Device Memory Read Transactions      603701      603701      603701
1                      l2_read_transactions                          L2 Read Transactions      608022      608022      608022
1                  l2_tex_read_transactions               L2 Transactions (Texture Reads)      125000      125000      125000

The above results were generated on a GTX 950. However, a GTX 850 Ti and a GTX 1050 showed the same cache hit rates. Am I misinterpreting the results or conducting the wrong experiment, or is the memory transaction size actually always 32 bytes, regardless of the cache configuration?

Thanks for any hints,
PA

The actual behavior here varies from one GPU architecture to another, according to my observations.

GPUs of Maxwell and Pascal families have a unified texture/L1. AFAIK texture has always had a 32 byte granularity, effectively (if for no other reason) because it is still flowing through L2, which always has a 32 byte granularity.

Prior architectures (Kepler, Fermi) had a separate dedicated L1 (and/or unified with shared memory), and for these architectures an L1 miss always triggered 128 bytes of downstream traffic.

But an L1 miss on Maxwell/Pascal may only trigger 32 bytes of downstream traffic, depending on the actual global access pattern.