Cuda initialization failure when converting trt model with different GPU

Description

Hi, I basically have a Dockerfile that successfully builds on a machine with Tesla V100, but fails on a machine wtih Tesla T4. I have uploaded the Dockerfile below. It fails at the line “RUN python3.7 /app/config/trt_convert.py”.
The error is: [TensorRT] Error: CUDA initialization failure with error 35. Please check your CUDA installatiion: Installation Guide Linux :: CUDA Toolkit Documentation

Why does this issue when the same base image is used and the CUDA, CUDNN and Tensorrt installed is the same?

Environment

TensorRT Version: 7.0.0
GPU Type: V100 vs T4
CUDA Version: 10.0.130
CUDNN Version: 7.6.5.32
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.7
TensorFlow Version (if applicable): NIL
PyTorch Version (if applicable): NIL
Baremetal or Container (if container which image + tag): Base Ubuntu 18.04 image

Relevant Files

Dockerfile (5.9 KB)

Hi,

We recommend you to use pre-built TensorRT containers to avoid setup-related issues, you can customize the image on top of it.

Or please refer to TensorRT/docker at main · NVIDIA/TensorRT · GitHub

Thank you.

Hi,

I attempted to use the image 20.01-py3 from TensorRT | NVIDIA NGC. I changed my first line to nvcr.io/nvidia/tensorrt:20.01-py3, and then successfully converted a darknet file to onnx and then to tensorrt and generate expected inference results on my local, which runs a Tesla V100. As I understand, this container is using Tensorrt 7.0.0.11 (that’s the version when I imported), so it was identical to my previously built image in that sense.

However, when I switch to AWS that runs a GPU instance with Tesla T4 with the EXACT same image, it fails again. I tried 2 runs on AWS:

  1. Doing the darknet to onnx conversion, followed by onnx to tensorrt conversion, both when building the image in AWS.
  2. Doing the darknet to onnx conversion on my local, then copying the onnx file into aws and doing the onnx to tensorrt conversion there.

For 1: I receive the error -
151 conv 256 1 x 1 / 1 64 x 36 x 256 → 64 x 36 x 256The command ‘/bin/sh -c python3 /app/config/darknet2onnx/demo_darknet2onnx.py /app/config/mobius-yolov4-csp-exp24.cfg /app/config/mobius_exp24.names /app/config/mobius-yolov4-csp-exp24_best.weights /app/config/0.jpg 1’ returned a non-zero code: 137

For 2: I receive an identical error to above - [TensorRT] Error: CUDA initialization failure with error 35. Please check your CUDA installatiion: Installation Guide Linux :: CUDA Toolkit Documentation

I have these questions:

  1. Should I be converting the darknet to onnx on the same machine that I convert onnx to tensorrt. Does this matter at all, and would it resolve the second error above?
  2. Error code 137 seems to be memory issue from what I read. Why would this happen at all?
  3. Should I upgrade to a higher version for the container? But again, why would it fix anything, since it runs perfectly on my local (Tesla V100). I checked the CUDA compute capability for both T4 (7.5) and V100 (7). From wikipedia: “CUDA SDK 10.0 – 10.2 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing). Last version with support for compute capability 3.0 and 3.2 (Kepler in part). 10.2 is the last official release for macOS, as support will not be available for macOS in newer releases.” So shouldn’t CUDA 10.2, what I am using in this image, support the T4?

Please advice. This is a very bizarre and frustrating problem and I have thrown almost everything at it.

I’ve checked the CUDA driver versions and found that mine are 450.51.06. Does that mean that something like Container Release Notes :: NVIDIA Deep Learning TensorRT Documentation, which states that " Release 20.07 is based on NVIDIA CUDA 11.0.194, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30.“”, should not have a driver issue as my driver is >= 450?

Are you facing this issue with generating the ONNX model?


We don’t need to generate the ONNX model on the same machine. But the TensorRT engine needs to be built on the same machine we run inference.


We couldn’t get it exactly, could you please share with us the error log. But (2) initialization error you’re facing could be due to package incompatibility, Please make sure you’re able to run the CUDA sample successfully before, CUDA Installation Guide for Linux


We recommend you please use the latest image and the latest TensorRT version. Version 20.07 is very old, there could be some issues which are resolved in the later versions.

  1. Are you facing this issue with generating the ONNX model?
  2. On replicating error code 137.
    Yes, I am facing this issue. I rebuilt and tested the conversion again, with AWS Tesla T4 is running with drivers 450.51.06. I used tensorrt container 21.02-py3. I receive the exact same error.

From the documentation, it states: “Release 21.02 is based on [NVIDIA CUDA 11.2.0], which requires [NVIDIA Driver] release 460.27.04 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51(or later R450).” This clearly explicit states that it will work on my hardware and driver.

At the same time, it continues to work properly on my local, which has a Tesla V100 with driver 418.165.02, which isn’t even in the stated supported drivers.

As for the exact error log, there isn’t much detail on this error, other than the darknet model being printed out and the command being shown to exit with error code 137. Here is an excerpt of the very long logs:
"
149 conv 256 1 x 1 / 1 64 x 36 x 256 → 64 x 36 x 256
150 route 148
151 conv 256 1 x 1 / 1 64 x 36 x 256 → 64 x 36 x 256The command ‘/bin/sh -c python3 /app/config/darknet2onnx/demo_darknet2onnx.py /app/config/mobius-yolov4-csp-exp24.cfg /app/config/mobius_exp24.names /app/config/mobius-yolov4-csp-exp24_best.weights /app/config/0.jpg 1’ returned a non-zero code: 137

[Container] 2022/09/02 15:00:53 Command did not exit successfully docker build --no-cache --build-arg no_proxy=$no_proxy --build-arg NO_PROXY=$no_proxy --build-arg http_proxy=$http_proxy --build-arg HTTP_PROXY=$http_proxy --build-arg HTTPS_PROXY=$http_proxy --build-arg https_proxy=$http_proxy --build-arg RUNTIME_BASE=$RUNTIME_BASE --build-arg GPU=True --build-arg CUDNN_HALF=True --build-arg SQS_QUEUE_URL=$SQS_QUEUE_URL -t $REPOSITORY_URI:$STAGE . exit status 137
"
I’m not sure your recommendation on running the CUDA sample first would prove anything though, as this is a container provided by Nvidia, and runs perfectly on Tesla V100, but breaks explicitly on the AWS Tesla T4. So I’m not sure if there is any package incompatibility at all. In fact, the V100’s drivers were less explicitly matched than those of Tesla T4.

  1. We recommend you please use the latest image and the latest TensorRT version. Version 20.07 is very old, there could be some issues which are resolved in the later versions.

Which version do you recommend? The issue is that my code was built in python3.7, and transitioning to higher python versions (which come with your higher image version) comes with additional code edits. Do you have any plausible insight into this issue? I already tried 6 of your containers to no avail, so what assurance is there that moving to a higher version and changing all my code will yield results?

Sorry, it’s not clear whether are you facing this issue on building the ONNX model or building the TensorRT engine. Were you able to successfully generate the ONNX model?

We recommended making sure CUDA is running correctly because the above error is more related to CUDA/driver.

You can try the latest TensorRT image 22.08, which has python 3.8, also python 3.8 doesn’t have many changes compared to 3.7.

Thank you.

Hi,

Thank you for the response.

I don’t think I have an issue with the onnx model. Anyways, I have a script that takes in the onnx model and tries to convert to tensorrt.

I tested my drivers using the ‘nvidia-smi’ command and I get:


So I have Cuda 11.2, and Driver 460.73.01
It is a Tesla T4 on an AWS instance.

I am using container 21.02-py3:
https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_21-02.html


It says it is running Cuda 11.2.0. It also requires driver release 460.27.04 or later.

So my Cuda version matches, my driver versions match. However, when I am running my trt_convert.py script, I get the error:
[TensorRT] ERROR: CUDA initialization failure with error 35. Please check your CUDA installation: CUDA Installation Guide for Linux
Traceback (most recent call last):
File “/app/config/trt_convert.py”, line 5, in
builder = trt.Builder(TRT_LOGGER)
TypeError: pybind11::init(): factory function returned nullptr
The command ‘/bin/sh -c python3 /app/config/trt_convert.py’ returned a non-zero code: 1

So after syncing up all the versions, I am confused why the issue is still happening. I looked up some others who get the same error:

Some suggested solutions:

  1. Inside /etc/docker/daemon.json add this line "default-runtime": "nvidia"
  2. Exposing the gpu (But when I run nvidia-smi my Tesla T4 is detected)
  3. Changing the container
  4. user was not in docker group for user docker on nvidia. Add user account into docker group with sudo usermod -aG docker $USER
  5. call `torch.cuda.current_device()’ first
  6. Despite nvidia-smi working properly, could it still be badly installed? How do I know?

Could you suggest an approach to tackle the problem? It seems like 2 is not viable. I am not sure about 3, do you have issues with 21.02-py3? 4 seems to have worked for someone else. Do you think it is viable for me? I am an AWS user. But note I have managed to run torch code fine before. Would 5 work? Would 6 work?

I see that the container is also 11.2.0 CUDA. I am not sure what x my CUDA 11.2.x is, but perhaps it is due to that? Should I attempt 21.03-py3 which utilizes CUDA 11.2.1? Also, could the drivers be an issue? It says Tesla T4 MAY use 450 drivers, but is my driver here (460) workable? The tensorrt container says it works generically with 460.27.04