Waht's the difference between 'wavefronts' and 'sectors/Req'?

I think I quite grasp the sectors/Req metric. Each request means a single instruction requesting for memory operation,
while each sector being 32B is accessed multiple times per request whether addresses requested by a warp are coalesced enough or not.

However, what is the ‘wavefront’ metric in memory table of nsight-cu-cli?
What is the exact meaning of it and what’s the difference from sectors/Req metric?
Seems they are related to each other.

You can check the Metrics Decoder section in the Kernel Profiling Guide: https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder
It explains all three quantities (and more).

I already read it, but still don’t understand.
It says the wavefront is

Number of unique “work packages” generated at the end of the processing stage for requests. All work items of a wavefront are processed in parallel, while work items of different wavefronts are serialized and processed on different cycles.

What are ‘work packages’, ‘processing stage’, and ‘items’? What about ‘at the end’ of the processing stage? Is ‘work package’ the number of sector accesses per request? I searched the term in the very same document but I could not find answers. This abstraction is confusing, so I would be grad if those are elaborated more.

A simplified model for the processing in L1TEX for Volta chips and newer architectures can be described as follows: When an SM executes a global/local/shared memory instruction for a warp, a single request is sent to L1TEX. This request communicates the information for all participating threads of this warp (up to 32 threads). For local and global memory, based on the access pattern and the participating threads, the request requires to access a number of cache lines, and sectors within these cache lines. The L1TEX unit has internally multiple processing stages operating in a pipeline.

A wavefront is the maximum unit of work that can pass through that pipeline stage per cycle. If not all cache lines or sectors can be accessed in a single wavefront, multiple wavefronts are created and sent for processing one by one, i.e. in a serialized manner. Limitations of the work within a wavefront may include the need for a consistent memory space, a maximum number of cache lines that can be accessed, as well as various other reasons. Each wavefront then flows through the L1TEX pipeline and fetches the sectors handled in that wavefront. The given relationships of the three key values in this model are requests:sectors is 1:N, wavefronts:sectors 1:N, and requests:wavefronts is 1:N.

In the documentation we describe a wavefront as a (work) package that can be processed at once, i.e. there is a notion of processing a wavefront per cycle in L1TEX. Wavefronts therefore represent the number of cycles required to process the requests, while the number of sectors per request is a property of the access pattern of the memory instruction for all participating threads. For example, it is possible to have a memory instruction that requires 4 sectors per request in 1 wavefront. However, you can also have a memory instruction having 4 sectors per request, but requiring 2 or more wavefronts.

Thank you for the detailed explanation!

2021년 1월 7일 (목) 오후 4:48, felix_dt via NVIDIA Developer Forums <nvidia@discoursemail.com>님이 작성: