Does the GPU have a similar technology for L1 caches or will NVIDIA bring out similar technology in future?

Benutzer2183 · November 29, 2021, 2:55pm

I read the following article which is about the optimization of the overall performance of the GPU for applications with irregular access patterns by improving the L1 cache:
https://dl.acm.org/doi/fullHtml/10.1145/3322127

In summary, Elastic-Cache/Plus works like this:
The cache line is divided in n logical chunks to enable storing data of non-contiguous memory space in each chunk. The cache architecture supports fine- and coarse-grained cache line management so 32 bytes or 128 bytes can be accessed. The meta data associated with the chunks of a cache set are stored in the shared memory across different banks to enable parallel access. The request queue of the Load/Store unit is divided into four separate queues each issuing 32-byte request also in parallel. All four queues are used to issue a single 128 byte request.

Such an architecture for the L1 cache should improve the IPC (instruction per cycle) and reduce the cache miss rate and improves the overall performance of applications with irregular access patterns.

My question is, did NVIDIA implement a similar technology or will NVIDIA bring out such a technology in future?

Robert_Crovella · November 29, 2021, 3:26pm

I won’t be able to answer the question in detail or provide forward looking statements. (I’d be very surprised if you found a forward-looking authoritative statement about that anywhere on these forums, by anyone at NVIDIA, at any time.)

However, the L1 cache architecture of GPUs did go through a design transition which is discoverable programmatically and somewhat documented, in the area of Maxwell and Pascal. Prior to that, the L1 cacheline was fixed at 128 bytes, and a L1 miss would trigger 128 bytes of “requests”. At least by the pascal generation, it is observable that the L1 cacheline has 32-byte components. Depending on the nature of the L1 miss, it’s possible that a L1 miss might only trigger 32 bytes of “requests” to the next level of the memory hierarchy.

I acknowledge this probably doesn’t address your questions.

A casual read of the paper suggests to me that although it was written in 2019, it seems to use a very dated viewpoint:

the size of L1 cache line is no smaller than that of warp-wide accesses.

We use GPGPU-Sim (version 3.2.2) [1, 17] for evaluation. We configure the simulator to model a GPU similar to NVIDIA’s GTX480:

Benutzer2183 · November 29, 2021, 3:38pm

Yes, that might be true. I have another question. In the paper it was written that the shared memory is mostly unused. Is that true? Are there applications that heavily rely on shared memory?

Robert_Crovella · November 29, 2021, 3:52pm

I guess you are referring to this:

Furthermore, unlike L1 cache, the shared memory of GPUs is not often used in many applications, which essentially depends on programmers.

The vast majority of high-performance parallel reductions on the GPU will use shared memory. The reduction operation is prevalent throughout computing. Other important functions, such as scan operations, also generally use shared memory. It has lately become possible to replace some of this shared memory usage with warp-shuffle operations, but that is by no means a panacea, nor does it account for the legacy codebase(s).

It’s true that shared memory usage (other than implicit usage that comes about e.g. through use of a library) requires explicit instructions from the programmer (unlike the L1). If an application fits into the category of embarrassingly parallel, then its quite possible that that application has no need for shared memory. However people want to solve many more than just these kinds of problems using parallel architectures. In those cases, one of the key benefits of shared memory (inter-thread communication) is often important.

Shared memory is also often used simply as a user managed cache. Starting with the Volta architecture, NVIDIA has designed substantially larger L1 caches, so the benefits of this type of usage might become less obvious.

system · December 14, 2021, 10:52am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Do dedicated shared memory and unified L1/Texture cache share the same bandwidth (Pascal)? CUDA Programming and Performance	5	2201	January 20, 2017
GPUs now seem to have comparable last level cache size compared to CPUs of same process CUDA Programming and Performance	5	1531	August 15, 2023
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9715	September 22, 2007
How can I check and see if my GPU is using L1 cache CUDA Programming and Performance	7	2951	June 9, 2011
Unified Memory vs Pinned Host Memory vs GPU Global Memory CUDA Programming and Performance	9	8361	June 1, 2022
In V100 GPU or A100 GPU, CUDA COREs- data miss in registers where do the CUDA core look first for data - in Shared Memory or in L1 cache? CUDA Programming and Performance	4	520	September 27, 2023
Cuda L1 bypass performance CUDA Programming and Performance	4	1206	March 30, 2022
paging stratigies for global memory any paging strategy on the way for CUDA CUDA Programming and Performance	3	2214	November 26, 2008
A (not so) hypothetical question CUDA Programming and Performance	6	1638	March 24, 2009
Misaligned Data Access Has No Effect on Performance? CUDA Programming and Performance	7	2125	May 24, 2018

Does the GPU have a similar technology for L1 caches or will NVIDIA bring out similar technology in future?

Related topics