I’m unsure if I should ask this here or on the Nsight forum, but I’m taking a guess that this is related to OptiX.
I’ve been trying to profile the path tracer application in the SDK, mainly looking at the cache hit rates.
However there seems to be a discrepancy between the ‘global’ hit rate numbers for L1/L2, and the numbers reported in the memory chart report.
I highlighted the two with red and green rectangles in the screenshot below:
As can be seen, the globally reported hit rates (L1: 85.05%, L2: 62.43%) is vastly different from the what the “Total” rows in the chart say (L1: 47.57%, L2: 59.57%).
Can anyone guide me as to what reason might cause this issue?
I haven’t seen this large of a gap between the two numbers in non-OptiX workloads, so my guess is that this may have to do something with memory accesses coming from the RT cores: e.g. that one of the numbers account for cache hits issued from RT cores, while the other only reflects accesses from the SMs.
Thank you in advance!
It is not expected that these values differ by this much. Minor differences are possible as those are actually different metrics. In the latest version of the tool, you can see the computation for the Memory Workload Analysis tables as a tooltip and compare that against the tooltip shown in the header table for that section.
To understand better why this might happen for you, could you provide some additional info:
- Which exact version of Nsight Compute are you using?
- Which GPU is this on?
- Did you disable cache control or clock control while profiling, which could reduce stability across replay passes?
- Was this collected using kernel or application replay?
As a note, you can use a newer version of Nsight Compute even with applications using older CUDA toolkits in case you want to try 2021.2.
Thanks for the response.
As for the infos, I’m using:
- Nsight Compute 2021.1
- Tesla T4, on an AWS EC2 instance (EDIT: g4dn.2xlarge)
- both cache and clock control are disabled, although I locked the SM/memory clock rate using nvidia-smi
- didn’t use replay, although not 100% sure what that means: I just did a “normal” non-interactive profiling.
I will try again the latest version. One question though, will the tooltip feature in the newer version still work if I load up the ncu result file that was produced using the 2021.1 version?
Thanks.
Thanks for the info. Cache control being disabled could definitely cause this, as these are multi-pass metrics, i.e. the kernel is executed multiple times and each time, some counters are collected. This means that counters generating one of the rates could be collected in one pass while the second metric in another. Without cache control, the performance behavior could significantly differ between the passes.
didn’t use replay, although not 100% sure what that means: I just did a “normal” non-interactive profiling.
If you didn’t configure anything, you used the default Kernel Replay for which the individual kernel is re-run multiple times. Cache control can be important (depending on the requested metrics) in this mode. The alternative is Application Replay where the whole app is re-run several times and counters are collected in each of these passes. For this mode, cache control can be normally disabled, assuming that the app produces stable initial cache contents itself.
You can change the replay mode using the respective settings in the Profile activity in the UI or using --replay-mode kernel/application on the command line.
will the tooltip feature in the newer version still work if I load up the ncu result file that was produced using the 2021.1 version?
As long as the metrics for the entries in the Memory Workload Analysis tables haven’t changed in between the two versions, it will work. For the header tables, it always works, as those metrics are defined by the .section files during collection, not by the UI. For your architecture and versions, there have been no changes for these metrics, so it will work.
Cache control being disabled could definitely cause this, as these are multi-pass metrics, i.e. the kernel is executed multiple times and each time, some counters are collected. This means that counters generating one of the rates could be collected in one pass while the second metric in another. Without cache control, the performance behavior could significantly differ between the passes.
Ah, I was not aware of the replay mechanism before and now things make more sense. Since my focus is on accurate cache metrics and flushing them between every kernel would likely perturb the accurate behavior, I think the Application replay mode is what I want to go with.
Still, I just took a look at the replay options and am a little confused with what the “replay match” and the “replay buffer” options mean. Could you give some explanation to this?
Unless you have very limited disk space, you don’t need to change the replay buffer setting. It will by default use a temporary file for stream the intermediate results to/from disk.
Replay match decides how kernel instances are matched across replay passes. The documentation on application replay that I linked earlier has a description on it, including a schematic.
Across application replay passes, NVIDIA Nsight Compute matches metric data for the individual, selected kernel launches.
The matching strategy can be selected using the --app-replay-match option.
For matching, only kernels within the same process and running on the same device are considered.
By default, the grid strategy is used, which matches launches according to their kernel name and grid size.
When multiple launches have the same attributes (e.g. name and grid size), they are matched in execution order.
If your app is fairly deterministic, you won’t need to change this option either and the grid strategy will work.
I see. Thanks for the guide and I’ll report back if the same problem happens with the new settings.
I rerun the experiment with 2021.2.0 and the Application Replay mode, but the discrepancy is still there. The tooltip does show me the formula for the hit rates; it looks the number in the chart is directly taken from a performance counter (l1tex__t_sector_hit_rate.pct
), whereas the number in the table sums all the lookup counts to different memory units (e.g. lsu_mem_local_..._lookup_hit.sum
) to get the hit/miss number and does a simple hit/(hit+miss) calculation. However this does not tell me exactly where the discrepancy is coming from.
Which one of the two do you think is more reliable? My guess is the .pct
number since it seems like directly coming from a performance counter; is this true?
Also, do you think this is somehow related to the fact that this application utilizes the RT cores?
I checked with the team that creates our metrics library and the primary difference between the two numbers appears to be that l1tex__t_sector_hit_rate.pct includes traffic coming from other units than the ones aggregated “manually” in the table.
With respect to the difference for L2, this is still under investigation. We still believe that this is caused by the fact that counters are collected over multiple passes (even in application replay).