GPU cuda cores or Tensor cores

Hi,
I’m using a 3x512x512 input tensor for a convolution layer.

When the kernel is 7x7 it runs at ~2.5ms and nvprof shows this:
GPU activities: 100.00% 26.878ms 10 2.6878ms 2.5754ms 2.9829ms void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1>, float, float, int=3, int=4, int=1, int=7, int=7, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)

When the kernel is 5x5 it runs at ~1.2ms and nvprof show this:
GPU activities: 100.00% 11.011ms 10 1.1011ms 1.0990ms 1.1058ms trt_volta_scudnn_128x32_relu_small_nn_v1

  • How do I know which implementation runs on the CUDA cores and which on Tensor cores?

  • Why the different implementation paths? Does it matter or scudnn means it runs on the CUDA cores(slower on Xavier) and the fusedConvolutionReluKernel on Tensor cores?

    I understand that 5x5 should run faster - I’m mostly baffled as to how to know I’m using the Tensor cores and not the slower CUDA cores.

thanks
Eyal

Hi,

1. There are two operation supports by Tensor Cores: HMMA and IMMA.
https://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-7x

Please run it with nvprof:

tensor_precision_fu_utilization : The utilization level of the multiprocessor function units that execute tensor core instructions on a scale of 0 to 10 (HMMA)
tensor_int_fu_utilization : The utilization level of the multiprocessor function units that execute tensor core int8 instructions on a scale of 0 to 10. This metric is only available for device with compute capability 7.2. (IMMA)

2. In general, TensorRT or cuDNN will automatically select an optimized implementation for you.
This will take GPU layerout and capacity into consideration so different input size may lead to different algorithm.

Thanks.