[TensorRT] Cuda Error in findFastestTactic: 4 (unspecified launch failure), Cuda Error in free: 4 (u...

I converted r3d18 of torchvision to trt model.

I am using:
Jetpack 4.3 TensorRT 6.0 Pytorch 1.3 Torchvision 0.4.2

Since torch2trt’s converter doesn’t have batchnorm3d, adaptiveaveragepool3d etc., I made this files in converter directory and built the torch2trt directory. Also, I tested the conv3d, bachnorm3d, adaptiveaveragepool3d and checked the results between before trt conversion and after trt conversion.

However, the torch2trt conversion from the r3d18 model of torchvision printed errors as follows. (I solved the unsupported tensorrt conversion modules)

[TensorRT] ERROR: …/builder/cudnnBuilderUtils.cpp (354) - Cuda Error in findFastestTactic: 4 (unspecified launch failure)
[TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 4 (unspecified launch failure)
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception

What should I do?
I need help.

Hi,

CUDA error 4 is cudaErrorLaunchFailure:

An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be
reconstructed if the program is to continue using CUDA.

May I know how do you update the converter for the nonsupport layer?
More, have you allocate any swap space on your environment.
Please noticed that swap space cannot be accessed via GPU.

Thanks.

Hi,

Above all, I didn’t allocate any swap space.

And the converter is written as fllows.

Conv3d.py

from torch2trt.module_test import add_module_test
from torch2trt.torch2trt import *

@tensorrt_converter(“torch.nn.Conv3d.forward”)
def convert_Conv3d(ctx):
module = ctx.method_args[0]
input = ctx.method_args[1]
input_trt = trt_(ctx.network, input)
output = ctx.method_return

kernel_size = module.kernel_size
if not isinstance(kernel_size, tuple):
    kernel_size = (kernel_size,) * 3

stride = module.stride
if not isinstance(stride, tuple):
    stride = (stride,) * 3

padding = module.padding
if not isinstance(padding, tuple):
    padding = (padding,) * 3

dilation = module.dilation
if not isinstance(dilation, tuple):
    dilation = (dilation,) * 3

kernel = module.weight.detach().cpu().numpy()

bias = trt.Weights(torch_dtype_to_trt(module.weight.dtype))
if module.bias is not None:
    bias = module.bias.detach().cpu().numpy()

layer = ctx.network.add_convolution_nd(
    input=input_trt,
    num_output_maps=module.out_channels,
    kernel_shape=kernel_size,
    kernel=kernel,
    bias=bias,
)
layer.stride_nd = stride
layer.padding_nd = padding
layer.dilation_nd = dilation

if module.groups is not None:
    layer.num_groups = module.groups

output._trt = layer.get_output(0)

@add_module_test(torch.float32, torch.device(“cuda”), [(1, 10, 128, 128, 128)])
def test_Conv3d_basic():
return torch.nn.Conv3d(10, 5, kernel_size=1, stride=1, padding=0)

@add_module_test(torch.float32, torch.device(“cuda”), [(1, 10, 128, 128, 128)])
def test_Conv3d_stride2():
return torch.nn.Conv3d(10, 5, kernel_size=1, stride=2, padding=0)

@add_module_test(torch.float32, torch.device(“cuda”), [(1, 10, 128, 128, 128)])
def test_Conv3d_kernel3():
return torch.nn.Conv3d(10, 5, kernel_size=3, stride=2, padding=1)

@add_module_test(torch.float32, torch.device(“cuda”), [(1, 10, 128, 128, 128)])
def test_Conv3d_dilation2():
return torch.nn.Conv3d(10, 5, kernel_size=3, stride=1, padding=1, dilation=2)

BatchNorm3d.py

from torch2trt.torch2trt import *

@tensorrt_converter(‘torch.nn.BatchNorm3d.forward’)
def convert_BatchNorm3d(ctx):
module = ctx.method_args[0]
input = ctx.method_args[1]
input_trt = trt_(ctx.network, input)
output = ctx.method_return

scale = module.weight.detach().cpu().numpy() / np.sqrt(module.running_var.detach().cpu().numpy() + module.eps)
bias = module.bias.detach().cpu().numpy() - module.running_mean.detach().cpu().numpy() * scale
power = np.ones_like(scale)

layer = ctx.network.add_scale_nd(input_trt, trt.ScaleMode.CHANNEL, bias, scale, power, 0)
output._trt = layer.get_output(0)

AdaptiveAvgPool3d.py

from torch2trt.torch2trt import *
from torch2trt.module_test import add_module_test

@tensorrt_converter(‘torch.nn.AdaptiveAvgPool3d.forward’)
def convert_AdaptiveAvgPool3d(ctx):
module = ctx.method_args[0]
input = ctx.method_args[1]
output = ctx.method_return
# print(input.shape)
input_trt = trt_(ctx.network, input)

output_size = module.output_size
if not isinstance(output_size, tuple):
    output_size = (output_size, ) * 3

stride = (input_trt.shape[-3] // output_size[-3], input_trt.shape[-2] // output_size[-2], input_trt.shape[-1] // output_size[-1])
# print(input_trt.shape, output_size, stride)
kernel_size = stride    
layer = ctx.network.add_pooling_nd(
    input=input_trt, type=trt.PoolingType.AVERAGE, window_size=kernel_size)
layer.stride_nd = stride

output._trt = layer.get_output(0)

@add_module_test(torch.float32, torch.device(‘cuda’), [(1, 3, 128, 128, 128)])
def test_AdaptiveAvgPool3d_1x1():
return torch.nn.AdaptiveAvgPool3d((1, 1, 1))

@add_module_test(torch.float32, torch.device(‘cuda’), [(1, 3, 128, 128, 128)])
def test_AdaptiveAvgPool3d_2x2():
return torch.nn.AdaptiveAvgPool3d((2, 2, 2))

@add_module_test(torch.float32, torch.device(‘cuda’), [(1, 3, 128, 128, 128)])
def test_AdaptiveAvgPool3d_3x3():
return torch.nn.AdaptiveAvgPool3d((3, 3, 3))

Thanks.

Hi,

Your implementation looks good to me.
So we need more information to figure out the cause.

Could you share the full TensorRT log with us?
By the more, do you use torch2trt from here:
https://github.com/NVIDIA-AI-IOT/torch2trt/tree/972a7c71ace90de77f36bd1fbd88113268abf5df/torch2trt/converters

Thanks.

Hi,

I used that github code of torch2trt:
https://github.com/NVIDIA-AI-IOT/torch2trt/tree/fc41653ec2d555806c555b447764934d08c8aa81

The full TensorRT log is

[TensorRT] ERROR: …/builder/cudnnBuilderUtils.cpp (354) - Cuda Error in findFastestTactic: 4 (unspecified launch failure)
[TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 4 (unspecified launch failure)
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception
Aborted (core dumped)

That’s it.

Is there another way to check full log of TensorRT?
If so, please let me know the way.

Thanks.

Hi,

Sorry for the late.

A possible cause of error 4 is the kernel read out of bound.
Would you mind to re-run the sample with cuda-memcheck to get more log information?

Thanks.

Hi,

I re-run the sample with cuda-memcheck
like this: cuda-memcheck python3 ~~~~

Then, the error file is

========= Internal Memcheck Error: Initialization failed
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 (cuDevicePrimaryCtxRetain + 0x154) [0x1fda6c]
========= Host Frame:/usr/local/cuda-10.0/lib64/libcudart.so.10.0 [0x2a708]

[TensorRT] ERROR: …/builder/cudnnBuilderUtils.cpp (354) - Cuda Error in findFastestTactic: 4 (unspecified launch failure)
[TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 4 (unspecified launch failure)
terminate called after throwing an instance of ‘nvinfer1::CudaError’
what(): std::exception
========= Error: process didn’t terminate successfully
========= No CUDA-MEMCHECK results found

Thanks.

Hi,

Sorry to keep you waiting.

We want to reproduce this issue in our environment.
This will require all your implementation and the detail steps for reproducing.

Do you think you can help to provide this for us?
We try to narrow down the issue but there are many possible causes for the launch failure.

Thanks.

Hi,

Sorry to keep you waiting.

I uploaded torch2trt testing code for my implementation.

Please use the github code as follows:

https://github.com/InwoongLee/action_rt_test

Thanks.