Reuse of L1/shared memory during execution of consecutive wavefronts

pmpakos · April 5, 2024, 8:47am

Hello.

From the NSight-Compute report analysis, I have read about how many wavefronts are required in order to complete execution of a kernel.
My question is: How is memory reused during consecutive wavefront execution? Is it re-written/flushed, or reused?

Thank you for your attention.

Robert_Crovella · April 5, 2024, 1:56pm

a wavefront is a result of running a single instruction, such as a shared load operation. It manifests when the instruction is processed and its action is being applied. For example, in a load operation, a wavefront will copy data from memory to registers. There isn’t any concept of “re-written”, “flushed”, or “reused” in that idea. Data is copied from point A (an address in memory, such as shared memory) to point B (registers).

Greg · April 7, 2024, 6:54pm

Nsight Compute Launch Statistics section has the metric “Waves per SM”. Number of waves per SM calculated as

Waves per SM = (Thread Blocks in Grid / (Number of SMs x Thread Blocks per SM)

The term “Wavefont” has linked by Robert refers to a command packet through the L1TEX pipeline.

These terms are confusing as other GPU vendors use Waves and Wavefronts to mean Thread Block/CTA.

At launch of a CTA the value of shared memory is uninitialized (undefined). The method of shared memory address virtualization and allocation is not documented. The shared memory is definitely reused. There is no additional memory hierarchy to which to flush shared memory. If the kernel author wants to reuse data that was in shared memory between virtual blocks then the two options are:

The producer calculates the value in shared memory and stores the shared memory to global memory. The consumer loads the shared memory from global memory back to shared. This approach is not recommended in the same grid launch.
An optimal launch with a grid stride loop or work queue can be used to maintain values in shared memory for multiple “virtual” blocks of work (as opposed to using blockIdx.

The CUDA programming model does not provide a method to pass data through shared memory for CTAs on the same SM but not co-resident (different waves). Thread Block Clusters can be used to (a) guarantee co-residency and (b) allow CTAs in the same cluster to read/write each others shared memory.

On a GPU context switch shared memory is cleared.

Topic		Replies	Views
Shared Memory Help needed CUDA Programming and Performance	1	730	March 25, 2011
Using shared Memory CUDA Programming and Performance	3	4935	March 11, 2012
Shared memory: released when unneded? CUDA Programming and Performance	4	3237	July 25, 2008
Difference in number of wavefronts for strided access to shared-memory and L1 cache in Ampere GPUs Nsight Compute hw	3	976	February 6, 2026
are the threads of a warp run serially？ CUDA Programming and Performance	9	1224	February 27, 2020
Shared Memory Persistence CUDA Programming and Performance	17	2478	November 25, 2020
Redundant loads of data into shared memory? CUDA Programming and Performance	1	413	July 13, 2018
How many times does a value need to be reused before its worth putting into shared memory? CUDA Programming and Performance	2	173	July 14, 2024
LSU Wavefront Scheduling and Shared Memory Bank Utilization on Blackwell CUDA Programming and Performance	7	182	February 20, 2026
Getting access to shared memory from different kernels is there a way to share it? CUDA Programming and Performance	4	3824	May 13, 2009

Reuse of L1/shared memory during execution of consecutive wavefronts

Related topics