I read the following article which is about the optimization of the overall performance of the GPU for applications with irregular access patterns by improving the L1 cache: https://dl.acm.org/doi/fullHtml/10.1145/3322127
In summary, Elastic-Cache/Plus works like this:
The cache line is divided in n logical chunks to enable storing data of non-contiguous memory space in each chunk. The cache architecture supports fine- and coarse-grained cache line management so 32 bytes or 128 bytes can be accessed. The meta data associated with the chunks of a cache set are stored in the shared memory across different banks to enable parallel access. The request queue of the Load/Store unit is divided into four separate queues each issuing 32-byte request also in parallel. All four queues are used to issue a single 128 byte request.
Such an architecture for the L1 cache should improve the IPC (instruction per cycle) and reduce the cache miss rate and improves the overall performance of applications with irregular access patterns.
My question is, did NVIDIA implement a similar technology or will NVIDIA bring out such a technology in future?
I won’t be able to answer the question in detail or provide forward looking statements. (I’d be very surprised if you found a forward-looking authoritative statement about that anywhere on these forums, by anyone at NVIDIA, at any time.)
However, the L1 cache architecture of GPUs did go through a design transition which is discoverable programmatically and somewhat documented, in the area of Maxwell and Pascal. Prior to that, the L1 cacheline was fixed at 128 bytes, and a L1 miss would trigger 128 bytes of “requests”. At least by the pascal generation, it is observable that the L1 cacheline has 32-byte components. Depending on the nature of the L1 miss, it’s possible that a L1 miss might only trigger 32 bytes of “requests” to the next level of the memory hierarchy.
I acknowledge this probably doesn’t address your questions.
A casual read of the paper suggests to me that although it was written in 2019, it seems to use a very dated viewpoint:
the size of L1 cache line is no smaller than that of warp-wide accesses.
We use GPGPU-Sim (version 3.2.2) [1, 17] for evaluation. We configure the simulator to model a GPU similar to NVIDIA’s GTX480:
Yes, that might be true. I have another question. In the paper it was written that the shared memory is mostly unused. Is that true? Are there applications that heavily rely on shared memory?
Furthermore, unlike L1 cache, the shared memory of GPUs is not often used in many applications, which essentially depends on programmers.
The vast majority of high-performance parallel reductions on the GPU will use shared memory. The reduction operation is prevalent throughout computing. Other important functions, such as scan operations, also generally use shared memory. It has lately become possible to replace some of this shared memory usage with warp-shuffle operations, but that is by no means a panacea, nor does it account for the legacy codebase(s).
It’s true that shared memory usage (other than implicit usage that comes about e.g. through use of a library) requires explicit instructions from the programmer (unlike the L1). If an application fits into the category of embarrassingly parallel, then its quite possible that that application has no need for shared memory. However people want to solve many more than just these kinds of problems using parallel architectures. In those cases, one of the key benefits of shared memory (inter-thread communication) is often important.
Shared memory is also often used simply as a user managed cache. Starting with the Volta architecture, NVIDIA has designed substantially larger L1 caches, so the benefits of this type of usage might become less obvious.