Reuse of L1/shared memory during execution of consecutive wavefronts

Hello.

From the NSight-Compute report analysis, I have read about how many wavefronts are required in order to complete execution of a kernel.
My question is: How is memory reused during consecutive wavefront execution? Is it re-written/flushed, or reused?

Thank you for your attention.

a wavefront is a result of running a single instruction, such as a shared load operation. It manifests when the instruction is processed and its action is being applied. For example, in a load operation, a wavefront will copy data from memory to registers. There isn’t any concept of “re-written”, “flushed”, or “reused” in that idea. Data is copied from point A (an address in memory, such as shared memory) to point B (registers).

Nsight Compute Launch Statistics section has the metric “Waves per SM”. Number of waves per SM calculated as

Waves per SM = (Thread Blocks in Grid / (Number of SMs x Thread Blocks per SM)

The term “Wavefont” has linked by Robert refers to a command packet through the L1TEX pipeline.

These terms are confusing as other GPU vendors use Waves and Wavefronts to mean Thread Block/CTA.

At launch of a CTA the value of shared memory is uninitialized (undefined). The method of shared memory address virtualization and allocation is not documented. The shared memory is definitely reused. There is no additional memory hierarchy to which to flush shared memory. If the kernel author wants to reuse data that was in shared memory between virtual blocks then the two options are:

  1. The producer calculates the value in shared memory and stores the shared memory to global memory. The consumer loads the shared memory from global memory back to shared. This approach is not recommended in the same grid launch.
  2. An optimal launch with a grid stride loop or work queue can be used to maintain values in shared memory for multiple “virtual” blocks of work (as opposed to using blockIdx.

The CUDA programming model does not provide a method to pass data through shared memory for CTAs on the same SM but not co-resident (different waves). Thread Block Clusters can be used to (a) guarantee co-residency and (b) allow CTAs in the same cluster to read/write each others shared memory.

On a GPU context switch shared memory is cleared.