I wanna profile pytorch model when making DL inference using Nsight Compute and get some information about each layer or operator performance. I found that torch.nn.Conv2d operator was decomposed into three kernels，i.e. computeOffsetsKernel, volta_scudnn_128x32_relu_interior_nn_v1 and unrolled_elementwise_kernel. I wanna know what these kernels mean and where I could find some details about them. Besides, would a CONV operator be always decomposed into above three kernels when launched in hardware? I need to have a clear understanding about these mechanisms because I want to get detailed performance metrics about each layer or op in pytorch model when DL inference instead of kernel-grained . Thank you so much !
To be precise, I wanna get SM occupancy， memory bandwidth and other performance metrics in op-level not kernel-level
@mstrengert can you find someone to answer this.