Low utilization of Tensor RT cores

Hi @kirpasaccessory,

  1. The “SM Warp Occupancy” field in the Nsys screenshots does NOT represent whether TensorCores are used. Based on our investigations, all the convs and the deconvs have been using TensorCores. Users can check that by looking for “h16816gemm” and “tensor16x8x16” keywords from the kernel names.
  2. Currently, the two DepthToSpace layers are taking ~30% of e2e runtime on A100 (which is similar to user’s RTX A6000). The issue is because the DepthToSpace op requires large amount of data movements. The recommended way to do upscaling is to use Resize (either NearestNeighbor Resize or Bilinear Resize). The perf of Resize layer will be much better.

Thank you.

Wait a minute, do you mean that “Tensor Active” row in the profiler don’t represent “utilization of tensor cores”? As you can see tensor cores is not 100% busy per one inference batch time and I can see plenty of spaces (time) between activity and for me it’s mean that tensor cores is not fully bussy.