Getting layer specific kernel metrics in a DL application


I have a Deep Learning model written in CUDA C++.

The way the layers in the model are called is within a for loop.

I am looking for a way to get the layer specific runtime and kernel metrics for each layer using either CUPTI or maybe if there is another way that would make it possible.

For example, following could be the code snippet:

for(idx = 0; idx < numLayers; idx++){





In the above example, the functio DLModel.predict(idx) is as follows:

void predict(double* in, double* out){


//cudaMalloc calls

//cudaMemCpy calls




In the above example, you can see that the kernel is being called at each layer, and what I want is to get the runtime of this kernel at each call and the metrics relevant to its call.

Could someone recommend a way this can be accomplished?

Should I use Nsight Compute or Nsight Systems for this?

Also, if I end up making NVTX instrumentation for this to have NVTX range surrounding the predict() function you saw above, would it give me the exact runtime of the kernels? I say this because I have a bunch of CPU and GPU activity intermingled inside the predict() function.

Would really appreciate help on this.

Looking forward to this :)


Nsight Systems will give you the accurate runtime of all kernels and API calls. If you annotate your code with NVTX, it will allow you to distinguish the kernels by NVTX context.

For collecting performance metrics, you would use Nsight Compute. Similarly to Nsight Systems, it allows you to filter/view the data by NVTX context, meaning you can collect metrics for e.g. all kernels, a certain number of kernels, all kernels within a specific NVTX range, etc. More details can be found in the documentation, e.g. here for filtering in Nsight Compute by NVTX context:

Nsight Compute is also able to collect the kernel runtime, but it takes some actions to ensure the most deterministic metric collection. This includes flushing all caches and locking clocks to their base values by default. In addition, it serializes all kernel calls. This can change the measured kernel runtime when compared to the real application execution, or when collecting this information with Nsight Systems.


Thank you for the reply.

This information is quite helpful, but the main question I had was how would I take into account the CPU functions that are being called within the predict() function I showed above.

Nsight Compute or Systmes, even with the NVTX range around the predict function, would give me the metrics/runtime of the CUDA kernels or other activites, but not take into account the CPU functions. So I am asking if there is a way to get the total runtime of the predict() function in the most accurate sense. Starting from call to the first kernel, including the CPU functions being used in between multiple kernel calls, till the end.

Thanks for the help :)


NVTX ranges are tracked and timed on the CPU. This means if you create an NVTX range around the predict() function, its duration will measure the runtime of the predict() function fairly well. Since the overhead for collecting the NVTX range is more or less constant, the precision will depend on the runtime of the predict function (i.e. the longer this function, the more precise the NVTX range can measure its runtime).

What you would consider the runtime of this function with respect to CPU and GPU work depends if all GPU work is synchronous with the CPU. If your synchronize your kernels within this function, so that they are guaranteed to finish before the CPU returns from the function, you can simply measure the CPU runtime of this function and be done. If you don’t synchronize your kernels, measuring the runtime of the predict function doesn’t seem to make too much sense.


Thanks so much for this.

However, I had a follow up question on this.

When you say that if I have NVTX range around the predict() function, it can measure the runtime of the predict() function fairly well, you mean that this can be done with Nsight Systems right?

In my understanding, when I profile to get these metrics and runtime of the predict() function, I would pass the flag to include this range for profiling.

For example,

nsys profile --nvtx-include “myRange” ./myApp

would help me get the runtime of the kernels which fall in this range (“myRange”). Here, how am I getting the CPU activity?

Your other suggestion where you asked me to synchronize the GPU calls with CPU ones, it is not acceptable as synchronizing these calls at each layer would create a lot of overhead.

I am sorry for the trouble, but I am just trying to get the perfect solution.



Just wanted to reach out and ask if there is an update on this?


Hi Lakshay,

When using Nsight Systems, NVTX calls are recorded with timestamps so you can see when you called them on the CPU in the timeline. In addition, the tool figures out for each kernel execution on the GPU which CPU call launched that kernel, and what the deepest-nested NVTX range was on that thread during that CPU call – this NVTX range is shown in a row next to the kernel execution rows for convenience. Nsight Systems is mostly a trace tool to give you a non-invasive recording of when CPU & GPU activity starts and stops – it does not do GPU profiling or any capturing of GPU metrics like Nsight Compute does.

Hope that helps!