Hello
I have a Deep Learning model written in CUDA C++.
The way the layers in the model are called is within a for loop.
I am looking for a way to get the layer specific runtime and kernel metrics for each layer using either CUPTI or maybe if there is another way that would make it possible.
For example, following could be the code snippet:
…
for(idx = 0; idx < numLayers; idx++){
someCpuFunc();
DLModel.predict(in[idx],out[idx]);
someCpuFunc();
}
…
In the above example, the functio DLModel.predict(idx) is as follows:
…
void predict(double* in, double* out){
someCpuFunc();
//cudaMalloc calls
//cudaMemCpy calls
layerKernel<<<G,B,T>>>(args);
someCpuFunc();
}
…
In the above example, you can see that the kernel is being called at each layer, and what I want is to get the runtime of this kernel at each call and the metrics relevant to its call.
Could someone recommend a way this can be accomplished?
Should I use Nsight Compute or Nsight Systems for this?
Also, if I end up making NVTX instrumentation for this to have NVTX range surrounding the predict() function you saw above, would it give me the exact runtime of the kernels? I say this because I have a bunch of CPU and GPU activity intermingled inside the predict() function.
Would really appreciate help on this.
Looking forward to this :)
Thanks
Lakshay