- The “SM Warp Occupancy” field in the Nsys screenshots does NOT represent whether TensorCores are used. Based on our investigations, all the convs and the deconvs have been using TensorCores. Users can check that by looking for “h16816gemm” and “tensor16x8x16” keywords from the kernel names.
- Currently, the two DepthToSpace layers are taking ~30% of e2e runtime on A100 (which is similar to user’s RTX A6000). The issue is because the DepthToSpace op requires large amount of data movements. The recommended way to do upscaling is to use Resize (either NearestNeighbor Resize or Bilinear Resize). The perf of Resize layer will be much better.