TensorRT 7 conv3d is not running on Tensor Cores

Description

Hello.
I’m trying to optimize pytorch NN on Jetson AGX Xavier 32GB with TensorRT, but I can’t make conv3d run on Tensor Cores. I’ve created really small and easy NN with convolution and activation only, and I expect it to run on tensor cores, but it’s not. However, if I replace conv3d with conv2d - I see conv2d running on tensor cores. I am running onnx model with trtexec and profiling it with nvprof + -m tensor_precision_fu_utilization and see no Tensor Core utilisation. Am I missing something?

I have another post in TensorRT section, but I was advised to post it in Jetson related forum.
Original post: TensorRT 7 conv3d is not running on Tensor Cores - #8 by maxim.godovicyn

Environment

TensorRT Version : 7.1.3-1+cuda10.2
GPU Type : Volta
Nvidia Driver Version : Jetpack 4.5.1
CUDA Version : 10.2
CUDNN Version : 8.0
Operating System + Version : 4.9.201-tegra (Jetpack 4.5.1)
Python Version (if applicable) : -
TensorFlow Version (if applicable) :-
PyTorch Version (if applicable) : 1.6 (for onnx model creation)
Baremetal or Container (if container which image + tag) :

Relevant Files

Onnx model:
test.onnx (432.6 KB)

Nvprof log file:
nvprof_test.log (9.5 KB)

Steps To Reproduce

Code to create onnx model:

import torch
from torch import nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv3d(64, 64, 3, padding=1)
        self.pool = nn.MaxPool3d(2, 2)

    def forward(self, x):
        x = self.conv(x)
        x = F.relu(x)
        return x

image_dims = (1, 64, 64, 64, 64)

dummy_input = torch.rand(image_dims, device="cuda")

model = Net().to("cuda")

torch.onnx.export(model, dummy_input, "test.onnx", verbose=True, opset_version=11)

Command to run model and create nvprof log:
sudo -E /usr/local/cuda-10.2/bin/nvprof -m tensor_precision_fu_utilization --log-file nvprof_test.log /usr/src/tensorrt/bin/trtexec --onnx=test.onnx --explicitBatch --dumpProfile --fp16 --verbose --workspace=4096

Hi,

Since we have TensorRT 8.0 currently, would you mind reproducing this with it first?
Thanks.

Hi,

I’ve tried to reproduce it on Xavier with jetpack 4.6 and i see absolutely the same problem. 3D convolutions is not on tensor cores.

nvprof log:
nvprof_test_TRT8.log (35.7 KB)

P.S.: I’ve used the same model.

Hi,

Any update on my issue? Can you confirm that you can reproduce this problem?

Thanks.

Hi,

https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html
Based on the above document, Tensor Core is used when below algorithm is selected:

CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD_NONFUSED

However, TensorRT will automatically choose a fast algorithm based on platform and hardware resources.
It’s not guaranteed that TensorRT will always use Tensor Cores for inference.

Thanks.

Hi,

So, how can i check that algorithms you’ve mentioned are used or at least enabled?

I understand that it is not guaranteed that TRP will use Tensor Cores, but when i run trtexec i do not see any “884” instructions in log, so i think TRT is not even trying to use Tensor Cores. Or, maybe Tensor Core instructions for conv3d is in log, but I don’t know how to identify them. Can you please check trtexec log for conv3d Tensor Core instructions?
Log file (TRT8 setup):
trtexec_test.log (38.8 KB)

Thanks.

Hi,

Please enable the cuDNN log and share it with us.

$ export CUDNN_LOGINFO_DBG=1
$ export CUDNN_LOGDEST_DBG=output.log
$ /usr/src/tensorrt/bin/trtexec ...

Thanks.

Hi,

No problem:
output.log (3.9 MB)

Thanks.

Hi,

Based on the API log, TensorRT does use CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM as the convolution algorithm.

We are checking this issue internally.
Will share more information with you later.

I! CuDNN (v8201) function cudnnConvolutionForward() called:
i!     handle: type=cudnnHandle_t; streamId=0x55c6ef49c0;
i!     alpha: type=CUDNN_DATA_FLOAT; val=1.000000;
i!     xDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[1,64,64,64,64];
i!         strideA: type=int; val=[16777216,262144,4096,64,1];
i!     xData: location=dev; addr=0x224ada000;
i!     wDesc: type=cudnnFilterDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         vect: type=int; val=0;
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[64,64,3,3,3];
i!         format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
i!     wData: location=dev; addr=0x21487a000;
i!     convDesc: type=cudnnConvolutionDescriptor_t:
i!         mode: type=cudnnConvolutionMode_t; val=CUDNN_CROSS_CORRELATION (1);
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         mathType: type=cudnnMathType_t; val=CUDNN_DEFAULT_MATH (0);
i!         reorderType: type=int; val=0;
i!         arrayLength: type=int; val=3;
i!         padA: type=int; val=[1,1,1];
i!         strideA: type=int; val=[1,1,1];
i!         dilationA: type=int; val=[1,1,1];
i!         groupCount: type=int; val=1;
i!     algo: type=cudnnConvolutionFwdAlgo_t; val=CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM (1);
i!     workSpace: location=dev; addr=0x22cada000;
i!     workSpaceSizeInBytes: type=size_t; val=72280576;
i!     beta: type=CUDNN_DATA_FLOAT; val=0.000000;
i!     yDesc: type=cudnnTensorDescriptor_t:
i!         dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i!         nbDims: type=int; val=5;
i!         dimA: type=int; val=[1,64,64,64,64];
i!         strideA: type=int; val=[16777216,262144,4096,64,1];
i!     yData: location=dev; addr=0x228ada000;
i! Time: 2021-09-06T23:34:06.382896 (0d+0h+0m+7s since start)
i! Process=12660; Thread=12660; GPU=0; Handle=0x55ac94de20; StreamId=0x55c6ef49c0.

Thanks.

Hi,

It seems the support is added in TensorRT v8.0.

We check your test.onnx model with JetPack 4.6 and do observe the Tensor Cores usage.
Please also give it a check:

==31039== NVPROF is profiling process 31039, command: /usr/src/tensorrt/bin/trtexec --onnx=conv3d.onnx --explicitBatch --dumpProfile --fp16 --verbose --workspace=4096
==31039== Profiling application: /usr/src/tensorrt/bin/trtexec --onnx=conv3d.onnx --explicitBatch --dumpProfile --fp16 --verbose --workspace=4096
==31039== Profiling result:
==31039== Metric result:
Invocations                               Metric Name                           Metric Description         Min         Max         Avg
Device "Xavier (0)"
    Kernel: volta_scudnn_128x64_3dconv_fprop_medium_nn_v1
         32           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: void fft3d_c2r_16x16x16<float2, float, __half>(__half*, float2*, int3, int3, int3, int3, int3, float, float, bool, int, __half*, __half*)
       4000           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: void xmma_trt::gemm::kernel<xmma_trt::implicit_gemm::fprop_indexed::Kernel_traits<xmma_trt::Volta_hmma_fp16_traits, xmma_trt::Cta_tile<xmma_trt::Volta<int=0>, int=128, int=256, int=32, int=2, int=4, int=1, int=1>, xmma_trt::implicit_gemm::fprop_indexed::Gmem_tile_a_t<xmma_trt::Volta_hmma_fp16_traits, xmma_trt::Cta_tile<xmma_trt::Volta<int=0>, int=128, int=256, int=32, int=2, int=4, int=1, int=1>, xmma_trt::implicit_gemm::Input_related<int=0, int=0, int=0, bool=0>, int=16, bool=0, xmma_trt::implicit_gemm::fprop_indexed::Gmem_tile_base_a<xmma_trt::Volta_hmma_fp16_traits, xmma_trt::Cta_tile<xmma_trt::Volta<int=0>, int=128, int=256, int=32, int=2, int=4, int=1, int=1>, xmma_trt::implicit_gemm::Input_related<int=0, int=0, int=0, bool=0>, int=16, xmma_trt::Row, int=32, int=128>>, xmma_trt::implicit_gemm::fprop_indexed::Gmem_tile_c_t<xmma_trt::Volta_hmma_fp16_traits, xmma_trt::Cta_tile<xmma_trt::Volta<int=0>, int=128, int=256, int=32, int=2, int=4, int=1, int=1>, int=16, xmma_trt::Fragment_c<xmma_trt::Volta_hmma_fp16_traits, xmma_trt::Cta_tile<xmma_trt::Volta<int=0>, int=128, int=256, int=32, int=2, int=4, int=1, int=1>, bool=0>>, xmma_trt::implicit_gemm::Input_related<int=0, int=0, int=0, bool=0>, int=1>>(xmma_trt::Volta_hmma_fp16_traitsParams)
         16           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    High (9)    High (9)    High (9)
    Kernel: sm70_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize64x32x64_stage1_warpsize2x1x2_g1_tensor8x8x4_t3r3s3_kernel_trt
         16           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (5)     Mid (5)     Mid (5)
    Kernel: sm70_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x128x64_stage1_warpsize2x2x1_g1_tensor8x8x4_t3r3s3_kernel_trt
         16           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    High (9)    High (9)    High (9)
    Kernel: sm70_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize256x128x32_stage1_warpsize4x2x1_g1_tensor8x8x4_t3r3s3_kernel_trt
         16           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    High (9)    High (9)    High (9)
    Kernel: sm70_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x256x32_stage1_warpsize2x4x1_g1_tensor8x8x4_t3r3s3_kernel_trt
         16           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    High (9)    High (9)    High (9)
...

Thanks.

Hi,

I see the same, but is it OK, that
“Kernel: volta_scudnn_128x64_3dconv_fprop_medium_nn_v1”
have Idle Tensor-Precision Function Unit Utilization? Does it mean that conv3d is not running on TensorCores or not? If not, so clarify please what does it mean?

Thanks.