TensorRT 7 conv3d is not running on Tensor Cores

Description

Hello.
I’m trying to optimize pytorch NN on Jetson AGX Xavier 32GB with TensorRT, but I can’t make conv3d run on Tensor Cores. I’ve created really small and easy NN with convolution and activation only, and I expect it to run on tensor cores, but it’s not. I am running onnx model with trtexec and profiling it with nvprof + -m tensor_precision_fu_utilization and see no Tensor Core utilisation. Am I missing something?

Environment

TensorRT Version: 7.1.3-1+cuda10.2
GPU Type: Volta
Nvidia Driver Version: Jetpack 4.5.1
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: 4.9.201-tegra (Jetpack 4.5.1)
Python Version (if applicable): -
TensorFlow Version (if applicable):-
PyTorch Version (if applicable): 1.6 (for onnx model creation)
Baremetal or Container (if container which image + tag):

Relevant Files

Onnx model:
test.onnx (432.6 KB)

Nvprof log file:
nvprof_test.log (9.5 KB)

Steps To Reproduce

Code to create onnx model:

import torch
from torch import nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv3d(64, 64, 3, padding=1)
        self.pool = nn.MaxPool3d(2, 2)

    def forward(self, x):
        x = self.conv(x)
        x = F.relu(x)
        return x

image_dims = (1, 64, 64, 64, 64)

dummy_input = torch.rand(image_dims, device="cuda")

model = Net().to("cuda")

torch.onnx.export(model, dummy_input, "test.onnx", verbose=True, opset_version=11)

Command to run model and create nvprof log:
sudo -E /usr/local/cuda-10.2/bin/nvprof -m tensor_precision_fu_utilization --log-file nvprof_test.log /usr/src/tensorrt/bin/trtexec --onnx=test.onnx --explicitBatch --dumpProfile --fp16 --verbose --workspace=4096

Hi,
Below link might help you with your query, Kindly check below link for all 3d support layers:

Thanks!

Hi,
Thanks for your answer, but I’ve checked these links and cudNN guidelines for 3D convolutions on Tensor Cores, but i don’t see any limitation I’ve broken. All of my input and output channels are multiple of 8, my kernel is (3, 3, 3), padding is (1, 1, 1). So what’s wrong? I have wrong board configuration or this net shouldn’t run on Tensor Core?

Hi,

It’s may be difficult to say whether a single layer can choose MMA format as it requires additional kernel to transform input which may make the overall performance is not that good.

Thank you.

Hi,

Thanks for your answer. I understand that, but when I do the same thing with conv2d I see tensorrt at least trying to use tensor core instructions (h884***). On the other hand, with 3d convolution I don’t see it. How can I check if conv3d can be offloaded to Tensor Cores. I believe, i checked all the limitations and i don’t see if I am breaking one.

Thank you

Hi,

I’ve double checked and replaced in my model creation code Conv3d/MaxPool3d with Conv2d/MaxPool2d and i see Conv2d executing on Tensor Cores.

import torch
from torch import nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(64, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)

    def forward(self, x):
        x = self.conv(x)
        x = F.relu(x)
        return x

#image_dims = (1, 64, 64, 64, 64)
image_dims = (1, 64, 64, 64)

dummy_input = torch.rand(image_dims, device="cuda")

model = Net().to("cuda")

torch.onnx.export(model, dummy_input, "test.onnx", verbose=True, opset_version=11)

Can you please share a link or some documentation on cudNN 8.0 for jetson platforms, because as I got it, it differs from desktop release and it’s not clear, what is supported and what is not?

Hi,

Sorry for the delayed response, hope following similar post will give more details to you.

For cuDNN please refer https://developer.nvidia.com/cudnn

If you still need further assistance w.r.t jetson, we recommend you to please post your concern on Jetson related forum to get better help.

Thank you.

Hi,

Thank you for your answer. The post you supposed unfortunately did not clarify. I understand that grouped conv3d is not supported and TRT have to split it and run kernel for each convolution, but why they are not on Tensor Cores?

I’ve created a new topic in jetson xavier section, thank you!