Pascal L1 cache

From Programming guide:

"Kepler serviced loads at a granularity of 128B when L1 caching of global loads was enabled and 32B otherwise. On Pascal the data access unit is 32B regardless of whether global loads are cached in L1. So it is no longer necessary to turn off L1 caching in order to reduce wasted global memory transactions associated with uncoalesced accesses. "

Maxwell had 128byte L1 cacheline and 32Byte L2 line right ?

Does this mean in Pascal L1 we get only latency advantage for data access. Bandwidth remains the same as L2.

L1/TEX and L2 have 128B cache lines. Cache lines consist of 4 32B sectors. The tag lookup is at 128B granularity. A miss does not imply that all sectors in the cache line will be filled.

Thanks Greg. so the cacheline services the request at the granularity of a sector or cacheline ? I mean can both cache lines return 128 bytes for a transaction ?

On Maxwell I was getting higher bandwidth from L1 compared to L2. Does this change on Pascal ?

Greg, does this mean that

  1. one cacheline, 128 bytes long, may contain only data from one 128-byte memory line
  2. it’s split into 4 sectors and each sectore may be filled or not
  3. memory transactions are 32-byte long and only actually requested 32-byte sectors are read from the memory
  4. write transactions are also 32-byte granular, so it doesn’t update entire 128-byte line ewhen only single byte changed
  5. this holds for all pascal SMs - 6.0, 6.1 and 6.2
    ?

Adding to that – if I do have a coherent access pattern, is there any kind of pre-fetching of the additional sectors of a cache line? Or does each cache line act as a 32-byte “direct mapped” cache?

The 128B L1/TEX and L2 cache line is composed of 4 32B sectors. In Maxwell/Pascal L1/TEX there is no sector promotion so only the 32B sectors that are addressed by threads are fetched. The details of the L2 cache are not documented and may change between architectures.

  1. correct
  2. correct
  3. correct
  4. Kepler - Pascal L1/TEX is write-through. There are byte masks. The granularity of the write is 32B and contains a 32b byte mask.
  5. yes

Greg, thank you so much!

Snarky, GPUs (unlike CPUs) doesn’t perform hardware (automatic) prefetching, they only read data that you explicitly requested

I suppose when the programming model has zillions of runnable threads and is designed to be tolerant of high latency, it makes sense to not speculate on pre-fetching.

That means that the effective cache line size for L1 is 32B, then, right? While there may only be tag bits for 128B cache lines, the cache hit/miss is determined based on the 32B line, which means that each 128B line is really a direct-mapped cache of 4 LUs?

I thought Maxwell L1 can fetch 128 bytes if a warp requests it. For example a float read by a warp is SIMD32 * 4 = 128 bytes and L1 can service an LDG.CI request with one transaction.

Ah! So, if the request is bigger than 32B, the cache line will fetch as much as is needed for the request?
That makes sense.

Well this was my understanding on Maxwell. But referring to my original question… programming guide says Pascal L1 services in 32 bytes. so the same request would result in 4 transactions…

@Greg, can you please confirm if L1 changes between Maxwell and Pascal

GM10x, GM20x, and GP10x have very similar TEX/L1 designs. Starting with GM20x TEX/L1 caching of global memory loads can be enabled on non-constant data. See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x.

For all Maxwell - Pascal the L1/TEX cache line size is 128B consisting of 4 32B sectors. On a load miss only the 32B sectors in the cache line that were accessed are loaded from L2.

The TEX/L1 cache will make the number of 32B requests required to satisfy all threads. Additional sectors in the cache line that are not accessed will not be pre-fetched. The CUDA profilers have enough TEX and L2 counters to write a quick test to show this behavior.

@Greg, Does the L1/L2 sector still apply in the new Volta architecture as Pascal? i.e. L1 and L2 have 128B cache lines. Cache lines consist of 4 32B sectors. The tag lookup is at 128B granularity.

The Volta L1 data cache has 128 byte cache lines divided into 4 sectors. For local and global accesses the tag stage can compare all 32 threads at a time. The tag stage can look up 4 tags per cycle resolving a maximum of 16 sectors (4 tags x 4 sectors). On miss the cache will only fetch the unique 32 byte sectors that missed. The full cache line is not automatically fetched from L2.

The Maxwell/Pascal L1 data cache had similar tag stage performance but local and global instructions were broken into multiple requests prior to the tag lookup

  • <=32-bit 8 threads/request
  • 64-bit 4 threads/request
  • 128-bit 2 threads/request
1 Like

Thanks Greg. For L2 cache in Volta? Is it still 128B cache line with 32B sector? I run a small test on Volta using Nvprof to test L2 cache, and I found the access granularity of L2 is 64B, not 32B as in Pascal! So, Is
Volta L2 128B cache line with 2 64B sectors or it is a non-sectored 64B cache line?

Volta L2 is similar design to Kepler-Pascal. 128 byte cache line with 4 x 32 byte sectors. Can you explain how you came to the conclusion that L2 has 64 byte sectors.

I did a quick kernel where each thread performs 25,600 4 byte reads. Thread stride by 64 bytes (also tried 128 bytes) so each thread always accesses a unique 64 byte (or 128 byte) range on each load.

In both cases the total L2 bytes read bytes was threads * 25,600 * 32 bytes as expected for 32 byte sectors.
If the L2 sector size was 64 bytes then the L2 read bytes should have been threads * 25,600 * 64 bytes.

If you have a reproducible please post the driver version, CUDA toolkit version, and minimal reproducible. nvprof may have a bug. I used the latest version of Nsight Compute to collect the counters.

Hi Greg, yes correct, if you run 327680 4 bytes reads, with thread stride = 64 bytes, you will get 327680 L2 reads. So, the “l2_tex_read_transactions” from Nvprof = 327680. However, please check the “dram_read_transcations” you will find this number is approximately 655360, i.e. 2X more read transactions than L2. This means that the Core-L2 access granularity is 32 bytes, but the L2-DRAM access granularity is 64 bytes. This implies that L2 is 64-byte sector or 64 bytes line. I remember this was not the case in P100. In P100, l2_tex_read_transactions = dram_read_transcations.

I am using the Quadro V100 (QV100), with CUDA 9.1 and driver version=430.14. I used the nvprof command line to collect the statistics, not the Nsight.

The simple kernel I used was like this;
global void kernel( float* A, float* B)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
B[idx64] = A[idx64];
}

I collect the statistics using the nvprof:
nvprof --concurrent-kernels off --print-gpu-trace -u us --metrics all --demangling off --csv --log-file data.csv ./a.out

Sorry, there is a typo in the kernel code above, the idx16 not idx64, so we have 64 bytes stride (16 * 4 bytes float)

Also, I used CUDA 10.1 and the results are the same.
note that, I am using the Qudro V100 card, not the Tesla V100.

Hi Greg, is there any explanation for this weird behavior in Volta? I tested this on the Pascal P100 and Volta TITANV as well, and it seems the 64B DRAM transaction is also there in TITANV but it did not exist in Pascal.
It is very important to know how the L2 cache works for strided and irregular accesses in Volta to optimize and tune irregular workloads to the underline hardware.

I can confirm with a quick test that your observation that an L2 load miss to 1 sector (in a 4 x 32 byte sector cache line) to data in device memory causes a 2 sector (64 bytes) request to the HBM2 memory controller. A quick test showed the data to be either the lower 2 32 byte sectors or upper 32 byte sectors. I have not run this test on other GPUs where the memory type used may also benefit from 64 byte requests (e.g. GDDR5x on GP10x).

The GV100 L2 maintains the same 128 byte cache line with 4 x 32 byte sectors as previous NVIDIA GPU architectures. My guess is that the change to 64 bytes on L2 miss to device memory is due to the HBM2 interface. I have not had time to test pinned system memory or peer memory accesses via L2 but I believe these would maintain the 32 byte request to the final destination.