Is serialization unavoidable while profiling L2 cache miss rates for concurrent kernels with Nsight Compute?

Hardware: GTX 1650 Ti (Turing, CC 7.5)
OS: Windows

I’m profiling L2 cache contention between 2 concurrent kernels launched on separate streams (so they can be on the same context, since I am not using NVIDIA MPS). I want to see the difference in the increasing of miss rates between victim alone and victim with enemy (that performs pointer chasing on L2).

actually i have 2 experimental scenarios:

  1. Baseline: Victim kernel runs alone (and i measure baseline L2 miss rate)
  2. Contention: Victim runs with enemy concurrently (here i expect higher miss rate)

so the expected behavior is that the victim should experience MORE L2 cache misses in the concurrent scenario because the enemy kernel continuously evicts its cache lines from L2.

i am witnessing execution time degradation and i am sure its from this L2 eviction because i am allocating distinct SMs to the enemy and the victim but i have a problem with nsight

My question : Is it feasible to use NCU to profile the victim kernel’s L2 miss metrics (lts__t_sectors_lookup_miss etc..) while the enemy runs truly concurrently on a separate stream?

My results have been unstable ( for a long time they’ve been showing the expected increase in misses during contention, but now showing the opposite pattern). I’m unsure if this is due to:

  • NCU serializing the kernels during profiling
  • Cache state not being properly reset between runs although i am flushing the L2
  • or mere incorrect profiling methodology for concurrent execution that i am using

Any guidance on the correct way to profile L2 cache interference between concurrent kernels would be greatly appreciated.

I would recommend profiling each kernel in isolation using Kernel Replay and then try to define a range around the combined two grid launches and run with Range Replay. This will let you see impact when run concurrently. There is not method to filter the PM counter when running concurrently to correlate hits/miss/requests to the target grid vs. enemy grid.

Alternatively, you can try collecting serialized kernels Application Replay with --cache-control=none to ensure NCU does not change the final state. This will not give you the understanding of concurrent impact.

actually i am specifying the —kernel-name of the target kernel (not the pointer chaser kernel) in the command, even with profiling only one kernel the metrics of the 2 kernels can still be mixed and the contention not captured?