Is serialization unavoidable while profiling L2 cache miss rates for concurrent kernels with Nsight Compute?

zedcyb · December 6, 2025, 11:25pm

Hardware: GTX 1650 Ti (Turing, CC 7.5)
OS: Windows

I’m profiling L2 cache contention between 2 concurrent kernels launched on separate streams (so they can be on the same context, since I am not using NVIDIA MPS). I want to see the difference in the increasing of miss rates between victim alone and victim with enemy (that performs pointer chasing on L2).

actually i have 2 experimental scenarios:

Baseline: Victim kernel runs alone (and i measure baseline L2 miss rate)
Contention: Victim runs with enemy concurrently (here i expect higher miss rate)

so the expected behavior is that the victim should experience MORE L2 cache misses in the concurrent scenario because the enemy kernel continuously evicts its cache lines from L2.

i am witnessing execution time degradation and i am sure its from this L2 eviction because i am allocating distinct SMs to the enemy and the victim but i have a problem with nsight

My question : Is it feasible to use NCU to profile the victim kernel’s L2 miss metrics (lts__t_sectors_lookup_miss etc..) while the enemy runs truly concurrently on a separate stream?

My results have been unstable ( for a long time they’ve been showing the expected increase in misses during contention, but now showing the opposite pattern). I’m unsure if this is due to:

NCU serializing the kernels during profiling
Cache state not being properly reset between runs although i am flushing the L2
or mere incorrect profiling methodology for concurrent execution that i am using

Any guidance on the correct way to profile L2 cache interference between concurrent kernels would be greatly appreciated.

Greg · December 8, 2025, 6:41pm

I would recommend profiling each kernel in isolation using Kernel Replay and then try to define a range around the combined two grid launches and run with Range Replay. This will let you see impact when run concurrently. There is not method to filter the PM counter when running concurrently to correlate hits/miss/requests to the target grid vs. enemy grid.

Alternatively, you can try collecting serialized kernels Application Replay with --cache-control=none to ensure NCU does not change the final state. This will not give you the understanding of concurrent impact.

zedcyb · December 10, 2025, 10:06pm

actually i am specifying the —kernel-name of the target kernel (not the pointer chaser kernel) in the command, even with profiling only one kernel the metrics of the 2 kernels can still be mixed and the contention not captured?

Topic		Replies	Views
When using Nsight Compute, are more than two kernels profiled separately or concurrently? Nsight Compute	2	423	March 5, 2024
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	717	October 12, 2021
Profiling one application having two concurent kernels Nsight Compute	3	720	June 8, 2023
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	721	August 13, 2019
Concurrent kernel execution Nsight Compute	3	383	June 12, 2024
Question about ncu CUDA Programming and Performance	3	81	July 10, 2025
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	3262	July 3, 2024
Profile 2 kernels at once Nsight Compute	5	196	September 13, 2025
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Visual Profiler and nvprof	0	855	July 8, 2019
Launch >= 2 ncu profilers simultaneously Nsight Compute	5	759	September 14, 2022

Is serialization unavoidable while profiling L2 cache miss rates for concurrent kernels with Nsight Compute?

Related topics