Profiling one application having two concurent kernels

Hello,

I am trying to profile a kernel inside an application which consists of two kernels. I tried to use Nsight Compute range option by putting the cudaprofilestart before the call of the kernel and the stop just after it. I am getting an execution time which is equal to 230us
However when using nsys, I am getting 127us. Why there is some difference? In addition am I able to check the cache hit/miss? Because I am putting the set option to full with Nsight Compute while using the range option but I am not able to get this information.

Thank you

Is there a reason you are using ranges? Nsight Compute will automatically detect the kernel call and profile it without you needing to add any ranges. If you are trying to limit was is profiled, there are various filters and knobs in the connection dialog Non-Interactive Profile > Filter tab to control that.

With respect to Nsight Systems vs. Compute - there are some details on how they differ in this thread: Cycles in nsight-compute and nsight-systems

You should be able to see cache hit rates in the Memory Workload Analysis section with the full or detailed metric set. If you don’t, please share a report and I can take a look.

Hello,

Yes I tried with kernel replay and it is giving the same results as if each kernel is executed alone. I have read in the documentation of Nsight Compute that with kernel replay, it executes the kernels in serial whereas without profiling it should be executed concurrently since they are executed on different stream.

That’s what I have understood.

Thanks again

The timing you’re seeing in Nsight Compute is for the entire range, as opposed to any individual kernel. Is that also what you’re comparing in Nsight Systems, i.e. using the profile start/stop APIs or is Nsight Systems profiling the entire workload? Can you share screenshots of where you’re seeing 230us and 127us? That may clarify whether these are comparable.

For cache hit/miss, you should still see that even if you use a range in the Memory Workload Analysis section (see below). What do you see in this section when you run your range profile in Nsight Compute?