Question about l1tex__data_pipe_lsu_wavefronts.avg

Hi ! I am profiling a kernel which mostly are global load and little store.

l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum.pct_of_peak_sustained_elapsed is 27.06244200248354 (in the memory region ), which i think maybe is related to l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum / cycles. The number is 154,655,210,723/4464653337 = 34%.

But L1: Data pipe Lsu Wavefront(in the Speed of light region) is 99.72% (l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_elapsed) . Why the two wavefront count differ so much? (99% vs 27%)(l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld and l1tex__data_pipe_lsu_wavefronts) most of my ops are global load.
image

In my observation, if the global load is perfectly coalesced, the two wavefront number is similar. Otherwise differs a lot. Is there different wavefront in L1/tex? thanks a lot for the help!

What exact metric is “cycles” here in your comparison? There are different types of cycles, particularly “active” and “elapsed”, and you can’t compare the two in such a measurement.

Thanks for the reply! the cycle is gpc__cycles_elapsed.max , which i think is the elapsed cycle, which is shown in the above of the details tab of nsight compute.

I just wonder the difference between l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_elapsed and l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum.pct_of_peak_sustained_elapsed . They seems both meature L1 wavefronts, but there number differs a lot (one 27%, other 99%). Also, how should i optimize them?

Thanks for the help!

I am checking with the team on this.

Any update on this? Thanks for the help!

I think both the metric descriptions and this schematic may help in understanding the differences:

l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_elapsed measures the # of local/global/shared + surface write wavefronts processed by Data-Stage, as pct of peak.

l1tex__t_output_wavefronts_pipe_lsu_mem_global_op_ld.sum measures the number of wavefronts sent to Data-Stage from T-Stage for global loads, as pct of peak. As you can see in the Memory Tables for L1/TEX, these are the wavefronts for Global Loads as well as Global Loads to Shared Store (i.e., LDGSTS instructions). LDGSTS instructions are used for loading from global memory to shared memory w/o requiring register accesses in between. However, based on the memory table, your code doesn’t seem to use LDGSTS.

Without knowing more about the remaining rules output of your report, I would expect that the Sectors/Req could be problematic, potentially due to a bad access pattern? Are there any suggestions generated by ncu to the same?

The average ratio of sectors to requests for the L1 cache. For the same number of active threads in a warp, smaller numbers imply a more efficient memory access pattern. For warps with 32 active threads, the optimal ratios per access size are: 32-bit: 4, 64-bit: 8, 128-bit: 16. Smaller ratios indicate some degree of uniformity or overlapped loads within a cache line. Higher numbers can imply uncoalesced memory accesses and will result in increased memory traffic.

As stated by @felix_dt the l1tex__data_pipe_lsu_wavefronts will count for many additional operations. In the case of a global load miss there will be additional data stage wavefronts to write the L2 return data into the RAMs and then the final read of the RAMs to return to the register file.

Thanks for the help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.