Hi,
I’m using a 3x512x512 input tensor for a convolution layer.
When the kernel is 7x7 it runs at ~2.5ms and nvprof shows this:
GPU activities: 100.00% 26.878ms 10 2.6878ms 2.5754ms 2.9829ms void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1>, float, float, int=3, int=4, int=1, int=7, int=7, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)
When the kernel is 5x5 it runs at ~1.2ms and nvprof show this:
GPU activities: 100.00% 11.011ms 10 1.1011ms 1.0990ms 1.1058ms trt_volta_scudnn_128x32_relu_small_nn_v1
-
How do I know which implementation runs on the CUDA cores and which on Tensor cores?
-
Why the different implementation paths? Does it matter or scudnn means it runs on the CUDA cores(slower on Xavier) and the fusedConvolutionReluKernel on Tensor cores?
I understand that 5x5 should run faster - I’m mostly baffled as to how to know I’m using the Tensor cores and not the slower CUDA cores.
thanks
Eyal