Question about l1tex__data_pipe_lsu_wavefronts.avg

oicuoicu0619 · February 25, 2025, 9:30am

Hi ! I am profiling a kernel which mostly are global load and little store.

l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum.pct_of_peak_sustained_elapsed is 27.06244200248354 (in the memory region ), which i think maybe is related to l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum / cycles. The number is 154,655,210,723/4464653337 = 34%.

But L1: Data pipe Lsu Wavefront(in the Speed of light region) is 99.72% (l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_elapsed) . Why the two wavefront count differ so much? (99% vs 27%)(l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld and l1tex__data_pipe_lsu_wavefronts) most of my ops are global load.

In my observation, if the global load is perfectly coalesced, the two wavefront number is similar. Otherwise differs a lot. Is there different wavefront in L1/tex? thanks a lot for the help!

felix_dt · March 3, 2025, 1:33pm

What exact metric is “cycles” here in your comparison? There are different types of cycles, particularly “active” and “elapsed”, and you can’t compare the two in such a measurement.

oicuoicu0619 · March 4, 2025, 6:56am

Thanks for the reply! the cycle is gpc__cycles_elapsed.max , which i think is the elapsed cycle, which is shown in the above of the details tab of nsight compute.

I just wonder the difference between l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_elapsed and l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum.pct_of_peak_sustained_elapsed . They seems both meature L1 wavefronts, but there number differs a lot (one 27%, other 99%). Also, how should i optimize them?

Thanks for the help!

felix_dt · March 5, 2025, 6:56am

I am checking with the team on this.

oicuoicu0619 · March 28, 2025, 6:55am

Any update on this? Thanks for the help!

felix_dt · April 3, 2025, 1:00pm

I think both the metric descriptions and this schematic may help in understanding the differences:

l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_elapsed measures the # of local/global/shared + surface write wavefronts processed by Data-Stage, as pct of peak.

l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum measures the number of wavefronts sent to Data-Stage from T-Stage for global loads, as pct of peak. As you can see in the Memory Tables for L1/TEX, these are the wavefronts for Global Loads as well as Global Loads to Shared Store (i.e., LDGSTS instructions). LDGSTS instructions are used for loading from global memory to shared memory w/o requiring register accesses in between. However, based on the memory table, your code doesn’t seem to use LDGSTS.

Without knowing more about the remaining rules output of your report, I would expect that the Sectors/Req could be problematic, potentially due to a bad access pattern? Are there any suggestions generated by ncu to the same?

The average ratio of sectors to requests for the L1 cache. For the same number of active threads in a warp, smaller numbers imply a more efficient memory access pattern. For warps with 32 active threads, the optimal ratios per access size are: 32-bit: 4, 64-bit: 8, 128-bit: 16. Smaller ratios indicate some degree of uniformity or overlapped loads within a cache line. Higher numbers can imply uncoalesced memory accesses and will result in increased memory traffic.

Greg · April 3, 2025, 1:28pm

As stated by @felix_dt the l1tex__data_pipe_lsu_wavefronts will count for many additional operations. In the case of a global load miss there will be additional data stage wavefronts to write the L2 return data into the RAMs and then the final read of the RAMs to return to the register file.

oicuoicu0619 · April 9, 2025, 8:53am

Thanks for the help!

veraj · April 23, 2025, 8:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High shared memory usage but low l1tex__data_bank_reads CUDA Programming and Performance	5	74	October 24, 2024
Find out more opportunities for accelerating SpMM using sparse tensor cores CUDA Programming and Performance cuda , kernel	5	456	March 24, 2024
Shared memory bank conflicts and nsight metric CUDA Programming and Performance	15	5412	October 19, 2024
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	876	April 18, 2023
Different wavefront between global and surface read Nsight Compute	0	534	January 22, 2022
Does global access to L1 cache include a shard memory component Nsight Compute	4	75	April 21, 2025
Understanding cache throughput in Nsight Nsight Compute	4	2578	July 30, 2021
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1767	August 8, 2023
Questions about “L1 Conflicts Shared N-way” & metrics related to “Excessive” Nsight Compute	0	1531	March 29, 2024
Problem about bank conflict test CUDA Programming and Performance	6	508	March 12, 2024

Question about l1tex__data_pipe_lsu_wavefronts.avg

Related topics