Profiling pytorch model using ncu

I wanna profile pytorch model when making DL inference using Nsight Compute and get some information about each layer or operator performance. I found that torch.nn.Conv2d operator was decomposed into three kernels,i.e. computeOffsetsKernel, volta_scudnn_128x32_relu_interior_nn_v1 and unrolled_elementwise_kernel. I wanna know what these kernels mean and where I could find some details about them. Besides, would a CONV operator be always decomposed into above three kernels when launched in hardware? I need to have a clear understanding about these mechanisms because I want to get detailed performance metrics about each layer or op in pytorch model when DL inference instead of kernel-grained . Thank you so much !

1 Like

To be precise, I wanna get SM occupancy, memory bandwidth and other performance metrics in op-level not kernel-level

@mstrengert can you find someone to answer this.

1 Like

In general, Nsight Compute is giving performance data at a kernel level, but we have recently added a feature called Range Replay that allows you to markup a larger region of code and get performance metrics associated with it. It may be useful to create a range around an operator. I have to admit that I haven’t done that myself.

With respect to “would a CONV operator be always decomposed into above three kernels when launched in hardware?” That seems like more of a question for the PyTorch team. Nsight Compute can only show what happened dynamically, not what will happen every time. Same for “what the kernels mean”. If I understand the question, that would be a question for the pytorch implementers.

Let me know if that helps or if there is any other information I can provide.