Description
It appears to me that TensorRT does not make use of kernels that use Tensor Cores for Conv3D. I tried running an ONNX model with a single Conv3D as well as constructing the network definition with TensorRT.
It is therefore slower than in pytorch.
I believe that I followed all recommendations for 3D convolutions, all entities are multiple of 8.
The Conv2D equivalent chooses a Tensor Core-enabled kernel.
I tested this on an RTX 2080 Ti and on a T4.
Environment
TensorRT Version: 7.1.3.4
GPU Type: RTX 2080 Ti / T4
Nvidia Driver Version: 440.33.01
CUDA Version: 10.2
CUDNN Version: 8.0.1
Operating System + Version: CentOS 7.7.1908
Python Version (if applicable): 3.6.8
Baremetal or Container (if container which image + tag): Baremetal
Relevant Files
See below.
Steps To Reproduce
Run the following script with python3.
import time
import os
import numpy as np
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
if __name__ == "__main__":
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(
EXPLICIT_BATCH
) as network:
input_tensor = network.add_input(
name="input_image", dtype=trt.float32, shape=(1, 64, 64, 64, 64)
)
# Add a convolution layer
conv1_w = np.ones((64, 64, 1, 1, 1), dtype=np.float32)
conv1_b = np.ones(64, dtype=np.float32)
conv1 = network.add_convolution_nd(
input=input_tensor,
num_output_maps=64,
kernel_shape=(1, 1, 1),
kernel=conv1_w,
bias=conv1_b,
)
conv1.get_output(0).name = "output_featuremap"
network.mark_output(conv1.get_output(0))
# build engine
builder.max_workspace_size = 8500000000 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.
builder.fp16_mode = True
# builder.strict_type_constraints = True
builder.min_find_iterations = 10
builder.average_find_iterations = 10
with builder.build_cuda_engine(network) as engine:
with open("to_mount/toy/toy.engine", "wb") as f:
f.write(engine.serialize())
# Inference
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
h_bindings = []
d_bindings = []
import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
for i in range(engine.num_bindings):
h_bindings.append(
cuda.pagelocked_empty(
trt.volume(engine.get_binding_shape(i)), dtype=np.float16
)
)
# Allocate device memory for inputs and outputs.
d_bindings.append(cuda.mem_alloc(h_bindings[-1].nbytes))
# Create a stream in which to copy inputs/outputs and run inference.
stream = cuda.Stream()
with engine.create_execution_context() as context:
print("Starting inference")
# Transfer input data to the GPU.
for i in range(engine.num_bindings):
cuda.memcpy_htod_async(d_bindings[i], h_bindings[i], stream)
# Run inference.
tstart = time.time()
context.execute_async(
bindings=d_bindings, stream_handle=stream.handle,
)
# Synchronize the stream
stream.synchronize()
# Return the host output.
print(f"Total Run Time: {time.time()-tstart}")
# Transfer predictions back from the GPU.
for i in range(engine.num_bindings):
cuda.memcpy_dtoh_async(h_bindings[i], d_bindings[i], stream)
# Synchronize the stream
stream.synchronize()
nv-nsight-cu yields:
==PROF== Connected to process 27360 (/usr/bin/python3.6)
Starting inference
==PROF== Profiling “tensor_elementwise_kernel” - 1: 0%…50%…100% - 13 passes
==PROF== Profiling “implicit_convolveNd_sgemm” - 2: 0%…50%…100% - 13 passes
==PROF== Profiling “op_generic_tensor_kernel” - 3: 0%…50%…100% - 13 passes
==PROF== Profiling “nchwTonchw” - 4: 0%…50%…100% - 13 passes
Total Run Time: 9.15361213684082
==PROF== Disconnected from process 27360