Pascal L1 cache

rgirish · April 26, 2017, 11:30pm

From Programming guide:

"Kepler serviced loads at a granularity of 128B when L1 caching of global loads was enabled and 32B otherwise. On Pascal the data access unit is 32B regardless of whether global loads are cached in L1. So it is no longer necessary to turn off L1 caching in order to reduce wasted global memory transactions associated with uncoalesced accesses. "

Maxwell had 128byte L1 cacheline and 32Byte L2 line right ?

Does this mean in Pascal L1 we get only latency advantage for data access. Bandwidth remains the same as L2.

Greg · April 27, 2017, 1:47am

L1/TEX and L2 have 128B cache lines. Cache lines consist of 4 32B sectors. The tag lookup is at 128B granularity. A miss does not imply that all sectors in the cache line will be filled.

rgirish · April 27, 2017, 4:55am

Thanks Greg. so the cacheline services the request at the granularity of a sector or cacheline ? I mean can both cache lines return 128 bytes for a transaction ?

On Maxwell I was getting higher bandwidth from L1 compared to L2. Does this change on Pascal ?

BulatZiganshin · April 29, 2017, 11:16am

Greg, does this mean that

one cacheline, 128 bytes long, may contain only data from one 128-byte memory line
it’s split into 4 sectors and each sectore may be filled or not
memory transactions are 32-byte long and only actually requested 32-byte sectors are read from the memory
write transactions are also 32-byte granular, so it doesn’t update entire 128-byte line ewhen only single byte changed
this holds for all pascal SMs - 6.0, 6.1 and 6.2
?

snarky · April 29, 2017, 4:18pm

Adding to that – if I do have a coherent access pattern, is there any kind of pre-fetching of the additional sectors of a cache line? Or does each cache line act as a 32-byte “direct mapped” cache?

Greg · April 29, 2017, 8:42pm

The 128B L1/TEX and L2 cache line is composed of 4 32B sectors. In Maxwell/Pascal L1/TEX there is no sector promotion so only the 32B sectors that are addressed by threads are fetched. The details of the L2 cache are not documented and may change between architectures.

correct
correct
correct
Kepler - Pascal L1/TEX is write-through. There are byte masks. The granularity of the write is 32B and contains a 32b byte mask.
yes

BulatZiganshin · April 30, 2017, 7:11pm

Greg, thank you so much!

Snarky, GPUs (unlike CPUs) doesn’t perform hardware (automatic) prefetching, they only read data that you explicitly requested

snarky · May 1, 2017, 12:28am

I suppose when the programming model has zillions of runnable threads and is designed to be tolerant of high latency, it makes sense to not speculate on pre-fetching.

That means that the effective cache line size for L1 is 32B, then, right? While there may only be tag bits for 128B cache lines, the cache hit/miss is determined based on the 32B line, which means that each 128B line is really a direct-mapped cache of 4 LUs?

rgirish · May 1, 2017, 4:41pm

I thought Maxwell L1 can fetch 128 bytes if a warp requests it. For example a float read by a warp is SIMD32 * 4 = 128 bytes and L1 can service an LDG.CI request with one transaction.

snarky · May 1, 2017, 5:23pm

Ah! So, if the request is bigger than 32B, the cache line will fetch as much as is needed for the request?
That makes sense.

rgirish · May 1, 2017, 7:12pm

Well this was my understanding on Maxwell. But referring to my original question… programming guide says Pascal L1 services in 32 bytes. so the same request would result in 4 transactions…

@Greg, can you please confirm if L1 changes between Maxwell and Pascal

Greg · May 1, 2017, 8:16pm

GM10x, GM20x, and GP10x have very similar TEX/L1 designs. Starting with GM20x TEX/L1 caching of global memory loads can be enabled on non-constant data. See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x.

For all Maxwell - Pascal the L1/TEX cache line size is 128B consisting of 4 32B sectors. On a load miss only the 32B sectors in the cache line that were accessed are loaded from L2.

The TEX/L1 cache will make the number of 32B requests required to satisfy all threads. Additional sectors in the cache line that are not accessed will not be pre-fetched. The CUDA profilers have enough TEX and L2 counters to write a quick test to show this behavior.

khairy2011xqgv6 · July 30, 2019, 3:15pm

@Greg, Does the L1/L2 sector still apply in the new Volta architecture as Pascal? i.e. L1 and L2 have 128B cache lines. Cache lines consist of 4 32B sectors. The tag lookup is at 128B granularity.

Greg · July 30, 2019, 4:51pm

The Volta L1 data cache has 128 byte cache lines divided into 4 sectors. For local and global accesses the tag stage can compare all 32 threads at a time. The tag stage can look up 4 tags per cycle resolving a maximum of 16 sectors (4 tags x 4 sectors). On miss the cache will only fetch the unique 32 byte sectors that missed. The full cache line is not automatically fetched from L2.

The Maxwell/Pascal L1 data cache had similar tag stage performance but local and global instructions were broken into multiple requests prior to the tag lookup

<=32-bit 8 threads/request
64-bit 4 threads/request
128-bit 2 threads/request

khairy2011xqgv6 · July 30, 2019, 5:06pm

Thanks Greg. For L2 cache in Volta? Is it still 128B cache line with 32B sector? I run a small test on Volta using Nvprof to test L2 cache, and I found the access granularity of L2 is 64B, not 32B as in Pascal! So, Is
Volta L2 128B cache line with 2 64B sectors or it is a non-sectored 64B cache line?

Greg · July 30, 2019, 10:37pm

Volta L2 is similar design to Kepler-Pascal. 128 byte cache line with 4 x 32 byte sectors. Can you explain how you came to the conclusion that L2 has 64 byte sectors.

I did a quick kernel where each thread performs 25,600 4 byte reads. Thread stride by 64 bytes (also tried 128 bytes) so each thread always accesses a unique 64 byte (or 128 byte) range on each load.

In both cases the total L2 bytes read bytes was threads * 25,600 * 32 bytes as expected for 32 byte sectors.
If the L2 sector size was 64 bytes then the L2 read bytes should have been threads * 25,600 * 64 bytes.

If you have a reproducible please post the driver version, CUDA toolkit version, and minimal reproducible. nvprof may have a bug. I used the latest version of Nsight Compute to collect the counters.

khairy2011xqgv6 · July 30, 2019, 11:08pm

Hi Greg, yes correct, if you run 327680 4 bytes reads, with thread stride = 64 bytes, you will get 327680 L2 reads. So, the “l2_tex_read_transactions” from Nvprof = 327680. However, please check the “dram_read_transcations” you will find this number is approximately 655360, i.e. 2X more read transactions than L2. This means that the Core-L2 access granularity is 32 bytes, but the L2-DRAM access granularity is 64 bytes. This implies that L2 is 64-byte sector or 64 bytes line. I remember this was not the case in P100. In P100, l2_tex_read_transactions = dram_read_transcations.

I am using the Quadro V100 (QV100), with CUDA 9.1 and driver version=430.14. I used the nvprof command line to collect the statistics, not the Nsight.

The simple kernel I used was like this;
global void kernel( float* A, float* B)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
B[idx64] = A[idx64];
}

I collect the statistics using the nvprof:
nvprof --concurrent-kernels off --print-gpu-trace -u us --metrics all --demangling off --csv --log-file data.csv ./a.out

khairy2011xqgv6 · July 30, 2019, 11:16pm

Sorry, there is a typo in the kernel code above, the idx16 not idx64, so we have 64 bytes stride (16 * 4 bytes float)

Also, I used CUDA 10.1 and the results are the same.
note that, I am using the Qudro V100 card, not the Tesla V100.

khairy2011xqgv6 · August 27, 2019, 9:58pm

Hi Greg, is there any explanation for this weird behavior in Volta? I tested this on the Pascal P100 and Volta TITANV as well, and it seems the 64B DRAM transaction is also there in TITANV but it did not exist in Pascal.
It is very important to know how the L2 cache works for strided and irregular accesses in Volta to optimize and tune irregular workloads to the underline hardware.

Greg · August 27, 2019, 11:25pm

I can confirm with a quick test that your observation that an L2 load miss to 1 sector (in a 4 x 32 byte sector cache line) to data in device memory causes a 2 sector (64 bytes) request to the HBM2 memory controller. A quick test showed the data to be either the lower 2 32 byte sectors or upper 32 byte sectors. I have not run this test on other GPUs where the memory type used may also benefit from 64 byte requests (e.g. GDDR5x on GP10x).

The GV100 L2 maintains the same 128 byte cache line with 4 x 32 byte sectors as previous NVIDIA GPU architectures. My guess is that the change to 64 bytes on L2 miss to device memory is due to the HBM2 interface. I have not had time to test pinned system memory or peer memory accesses via L2 but I believe these would maintain the 32 byte request to the final destination.

Topic		Replies	Views
L1/L2 cache profiling in jetson nano CUDA Programming and Performance cuda , jetson-nano	2	442	January 15, 2024
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4031	December 23, 2015
Memory transaction size CUDA Programming and Performance	1	1723	February 12, 2017
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7146	April 17, 2013
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	71	January 8, 2025
Multiprocessor architecture CUDA Programming and Performance	11	801	November 25, 2020
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1038	April 7, 2024
L1 Cache Effective Bandwidth CUDA Programming and Performance	2	1057	March 9, 2023
32B memory transactions on Fermi? CUDA Programming and Performance	13	2218	November 27, 2010
Cache line size of L1 and L2 CUDA Programming and Performance	3	20305	November 14, 2011

Pascal L1 cache

Related topics