Nsight Compute: discrepancy in cache reports for OptiX applications

hansung_kim · June 30, 2021, 2:31pm

I’m unsure if I should ask this here or on the Nsight forum, but I’m taking a guess that this is related to OptiX.

I’ve been trying to profile the path tracer application in the SDK, mainly looking at the cache hit rates.
However there seems to be a discrepancy between the ‘global’ hit rate numbers for L1/L2, and the numbers reported in the memory chart report.
I highlighted the two with red and green rectangles in the screenshot below:

As can be seen, the globally reported hit rates (L1: 85.05%, L2: 62.43%) is vastly different from the what the “Total” rows in the chart say (L1: 47.57%, L2: 59.57%).

Can anyone guide me as to what reason might cause this issue?
I haven’t seen this large of a gap between the two numbers in non-OptiX workloads, so my guess is that this may have to do something with memory accesses coming from the RT cores: e.g. that one of the numbers account for cache hits issued from RT cores, while the other only reflects accesses from the SMs.

Thank you in advance!

felix_dt · July 1, 2021, 3:50pm

It is not expected that these values differ by this much. Minor differences are possible as those are actually different metrics. In the latest version of the tool, you can see the computation for the Memory Workload Analysis tables as a tooltip and compare that against the tooltip shown in the header table for that section.

To understand better why this might happen for you, could you provide some additional info:

Which exact version of Nsight Compute are you using?
Which GPU is this on?
Did you disable cache control or clock control while profiling, which could reduce stability across replay passes?
Was this collected using kernel or application replay?

As a note, you can use a newer version of Nsight Compute even with applications using older CUDA toolkits in case you want to try 2021.2.

hansung_kim · July 1, 2021, 4:02pm

Thanks for the response.
As for the infos, I’m using:

Nsight Compute 2021.1
Tesla T4, on an AWS EC2 instance (EDIT: g4dn.2xlarge)
both cache and clock control are disabled, although I locked the SM/memory clock rate using nvidia-smi
didn’t use replay, although not 100% sure what that means: I just did a “normal” non-interactive profiling.

I will try again the latest version. One question though, will the tooltip feature in the newer version still work if I load up the ncu result file that was produced using the 2021.1 version?
Thanks.

felix_dt · July 1, 2021, 4:12pm

Thanks for the info. Cache control being disabled could definitely cause this, as these are multi-pass metrics, i.e. the kernel is executed multiple times and each time, some counters are collected. This means that counters generating one of the rates could be collected in one pass while the second metric in another. Without cache control, the performance behavior could significantly differ between the passes.

didn’t use replay, although not 100% sure what that means: I just did a “normal” non-interactive profiling.

If you didn’t configure anything, you used the default Kernel Replay for which the individual kernel is re-run multiple times. Cache control can be important (depending on the requested metrics) in this mode. The alternative is Application Replay where the whole app is re-run several times and counters are collected in each of these passes. For this mode, cache control can be normally disabled, assuming that the app produces stable initial cache contents itself.

You can change the replay mode using the respective settings in the Profile activity in the UI or using --replay-mode kernel/application on the command line.

will the tooltip feature in the newer version still work if I load up the ncu result file that was produced using the 2021.1 version?

As long as the metrics for the entries in the Memory Workload Analysis tables haven’t changed in between the two versions, it will work. For the header tables, it always works, as those metrics are defined by the .section files during collection, not by the UI. For your architecture and versions, there have been no changes for these metrics, so it will work.

hansung_kim · July 1, 2021, 4:25pm

Cache control being disabled could definitely cause this, as these are multi-pass metrics, i.e. the kernel is executed multiple times and each time, some counters are collected. This means that counters generating one of the rates could be collected in one pass while the second metric in another. Without cache control, the performance behavior could significantly differ between the passes.

Ah, I was not aware of the replay mechanism before and now things make more sense. Since my focus is on accurate cache metrics and flushing them between every kernel would likely perturb the accurate behavior, I think the Application replay mode is what I want to go with.

Still, I just took a look at the replay options and am a little confused with what the “replay match” and the “replay buffer” options mean. Could you give some explanation to this?

felix_dt · July 1, 2021, 4:30pm

Unless you have very limited disk space, you don’t need to change the replay buffer setting. It will by default use a temporary file for stream the intermediate results to/from disk.

Replay match decides how kernel instances are matched across replay passes. The documentation on application replay that I linked earlier has a description on it, including a schematic.

Across application replay passes, NVIDIA Nsight Compute matches metric data for the individual, selected kernel launches.
The matching strategy can be selected using the --app-replay-match option.
For matching, only kernels within the same process and running on the same device are considered.
By default, the grid strategy is used, which matches launches according to their kernel name and grid size.
When multiple launches have the same attributes (e.g. name and grid size), they are matched in execution order.

If your app is fairly deterministic, you won’t need to change this option either and the grid strategy will work.

hansung_kim · July 1, 2021, 4:41pm

I see. Thanks for the guide and I’ll report back if the same problem happens with the new settings.

hansung_kim · July 2, 2021, 6:47am

I rerun the experiment with 2021.2.0 and the Application Replay mode, but the discrepancy is still there. The tooltip does show me the formula for the hit rates; it looks the number in the chart is directly taken from a performance counter (l1tex__t_sector_hit_rate.pct), whereas the number in the table sums all the lookup counts to different memory units (e.g. lsu_mem_local_..._lookup_hit.sum) to get the hit/miss number and does a simple hit/(hit+miss) calculation. However this does not tell me exactly where the discrepancy is coming from.

Which one of the two do you think is more reliable? My guess is the .pct number since it seems like directly coming from a performance counter; is this true?
Also, do you think this is somehow related to the fact that this application utilizes the RT cores?

felix_dt · July 13, 2021, 11:24am

I checked with the team that creates our metrics library and the primary difference between the two numbers appears to be that l1tex__t_sector_hit_rate.pct includes traffic coming from other units than the ones aggregated “manually” in the table.

With respect to the difference for L2, this is still under investigation. We still believe that this is caused by the fact that counters are collected over multiple passes (even in application replay).

Topic		Replies	Views
Question about cache metrics Nsight Compute	3	652	March 10, 2023
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	645	August 13, 2019
Question about profiling nccl kernels with Nsight Compute Nsight Compute	20	4946	February 13, 2025
Profiling memory coherency of OptiX application with Nsight Systems and Nsight Compute OptiX	6	850	March 30, 2023
Nsight and nvprof results have large differences Nsight Compute	9	1181	November 26, 2019
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1395	April 25, 2020
Profiling optix OptiX	10	1347	October 25, 2022
Profiling one application having two concurent kernels Nsight Compute	3	609	June 8, 2023
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	109	January 8, 2025
P100 global_hit_rate and and tex_cache_hit_rate CUDA Programming and Performance	6	837	November 4, 2018

Nsight Compute: discrepancy in cache reports for OptiX applications

Related topics