Memory transaction size

_PA · February 12, 2017, 7:55pm

Hello,

I am trying to understand the effects of L1 and L2 caching in CUDA and have some trouble matching some profiling results with the statements in the programming guide. Specifically, I cannot seem to produce results consistent with the 128 byte memory transactions that should be used with L1 caching of global memory accesses enabled (using ‘-Xptxas -dlcm=ca’).

For profiling, I use the following (obviously inefficient) code:

#define NUM_ELEMENTS (1000000)

__device__ int array[NUM_ELEMENTS];
__device__ int sum = 0;

__global__ void kernel_forward()
{
  int s = 0;
  for(int i = 0; i < NUM_ELEMENTS; i++)
    s += array[i];
  sum = s;
}

int main(int argc, char **argv)
{
  kernel_forward<<<1, 1>>>();
  cudaDeviceSynchronize();

  return 0;
}

The results with L2-only caching (‘-Xptxas -dlcm=cg’ or no such argument) are as one might expect, i.e., a cache miss for every 8th 4-byte word:

1                      l2_tex_read_hit_rate                   L2 Hit Rate (Texture Reads)      87.50%      87.50%      87.50%
1                        tex_cache_hit_rate                        Unified Cache Hit Rate       0.00%       0.00%       0.00%
1                           global_hit_rate                               Global Hit Rate       0.00%       0.00%       0.00%
1                    dram_read_transactions               Device Memory Read Transactions      645084      645084      645084
1                      l2_read_transactions                          L2 Read Transactions     1535211     1535211     1535211
1                  l2_tex_read_transactions               L2 Transactions (Texture Reads)     1000000     1000000     1000000

With ‘-Xptxas -dlcm=ca’, I would have expected an L1 hit rate of (1 - 1/32) = 96.875%, due to this statement in the programming guide, referring to CC 2.x but referenced in descriptions of the subsequent CCs as well:

“A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions.”

However, the results indicate that although cache hits moved to L1, there is still a cache miss for every 8th element:

1                      l2_tex_read_hit_rate                   L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%
1                        tex_cache_hit_rate                        Unified Cache Hit Rate      87.50%      87.50%      87.50%
1                           global_hit_rate                               Global Hit Rate      87.50%      87.50%      87.50%
1                    dram_read_transactions               Device Memory Read Transactions      603701      603701      603701
1                      l2_read_transactions                          L2 Read Transactions      608022      608022      608022
1                  l2_tex_read_transactions               L2 Transactions (Texture Reads)      125000      125000      125000

The above results were generated on a GTX 950. However, a GTX 850 Ti and a GTX 1050 showed the same cache hit rates. Am I misinterpreting the results or conducting the wrong experiment, or is the memory transaction size actually always 32 bytes, regardless of the cache configuration?

Thanks for any hints,
PA

Robert_Crovella · February 12, 2017, 8:54pm

The actual behavior here varies from one GPU architecture to another, according to my observations.

GPUs of Maxwell and Pascal families have a unified texture/L1. AFAIK texture has always had a 32 byte granularity, effectively (if for no other reason) because it is still flowing through L2, which always has a 32 byte granularity.

Prior architectures (Kepler, Fermi) had a separate dedicated L1 (and/or unified with shared memory), and for these architectures an L1 miss always triggered 128 bytes of downstream traffic.

But an L1 miss on Maxwell/Pascal may only trigger 32 bytes of downstream traffic, depending on the actual global access pattern.

Topic		Replies	Views
Get a 100% L1 Cache Hit Rate CUDA Programming and Performance	3	3287	October 12, 2021
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1437	August 28, 2019
variable cache line width ? CUDA Programming and Performance	4	2012	January 13, 2015
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	103	January 8, 2025
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1676	January 5, 2017
Pascal L1 cache CUDA Programming and Performance	21	11839	January 20, 2024
L2 read/write misses greater than requests CUDA Programming and Performance	11	3032	May 11, 2011
Multiprocessor architecture CUDA Programming and Performance	11	822	November 25, 2020
L1 and L2 cache hit rate CUDA Programming and Performance	8	6557	February 3, 2016
L1-L2-Global how to clearly describe their interaction for a given kernel CUDA Programming and Performance	3	2065	April 15, 2012

Memory transaction size

Related topics