When will we want to use L1?

202476410arsmart · August 16, 2024, 7:21am

I noticed that in GEMM, we will only use SMEM to load A/B, and store C. Seldom will we use L1/TEX. Is there any classic situation we want to use L1?

By the way, I noticed cutlass uses L1/TEX for GEMM on Hopper, but I do not understand how. Is this important? (But cublas does not)

cutlass

cutlass1938×1048 172 KB

cublas

cublas1938×1048 173 KB

Curefab · August 16, 2024, 7:38am

Perhaps in those cases shared memory is used. It is like manual L1.

202476410arsmart · August 16, 2024, 7:39am

Oh…? So what do you mean?

Curefab · August 16, 2024, 7:55am

Using shared memory would not necessarily appear as L1 hit, you have to look at the tables below the graph in Compute Nsight to see the amount of shared memory stores and loads.

If you handle the data caching manually with shared memory, no L1 cache is needed.

202476410arsmart · August 16, 2024, 8:15am

Yeah, I know. But when will we want to use L1 but not SMEM? I guess, graph problem? Or anything else?

Curefab · August 16, 2024, 9:53am

With shared memory you can better control, when to start and when to end storage of data. With shared each thread can access a different bank, so less coalescing requirements.

L1 is simpler to use. If you cannot predict, which data to reuse or it is complicated, e.g. because the timing of different warps needing access to the data is not predetermined or because your overall working set is larger than the shared memory size or your program speed is not bound by it, or your problem and algorithm do not need repeated accesses, like element-wise operations.

Many kernels could be written to either use L1 or shared memory (or use each for different kind of data).

You can also write kernels, which use neither, because they keep the data within warps or threads.

Warps can use shuffle instructions to exchange data (shuffle uses the data shuffling portion of shared memory, but not the memory itself), threads can hold data in registers. If you unroll loops, you can store indexed arrays in registers. For example I have done complex FFT calculations with array size of 64 complex numbers in registers (needs 128 registers) of a single thread. So you can calculate 32 FFTs per warp simultaneously. No need for shared memory.

system · August 30, 2024, 9:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to use L1 cache instead of shared memory when implementing blocked matmuls in CUDA CUDA Programming and Performance	4	1477	June 18, 2023
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3161	April 25, 2010
Issues about L1 cache CUDA Programming and Performance	10	300	February 26, 2025
What's the difference between L1 cache and the shared memory CUDA Programming and Performance	4	15650	October 29, 2011
Shared memory of SM CUDA Programming and Performance	1	436	October 31, 2019
ptx-isa cache operator to say "L1 only" making shared memory redundant CUDA Programming and Performance	0	3308	January 3, 2012
More Shared Memory by disabling L1 Cache? CUDA Programming and Performance	3	1341	February 24, 2013
Shared/cache memory management for HPC with large data required per thread CUDA Programming and Performance	6	1231	May 10, 2017
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1246	April 26, 2013
How many times does a value need to be reused before its worth putting into shared memory? CUDA Programming and Performance	2	158	July 14, 2024

When will we want to use L1?

Related topics