Cuda initialization failure when converting trt model with different GPU

realimposter · August 31, 2022, 5:36pm

Description

Hi, I basically have a Dockerfile that successfully builds on a machine with Tesla V100, but fails on a machine wtih Tesla T4. I have uploaded the Dockerfile below. It fails at the line “RUN python3.7 /app/config/trt_convert.py”.
The error is: [TensorRT] Error: CUDA initialization failure with error 35. Please check your CUDA installatiion: Installation Guide Linux :: CUDA Toolkit Documentation

Why does this issue when the same base image is used and the CUDA, CUDNN and Tensorrt installed is the same?

Environment

TensorRT Version: 7.0.0
GPU Type: V100 vs T4
CUDA Version: 10.0.130
CUDNN Version: 7.6.5.32
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.7
TensorFlow Version (if applicable): NIL
PyTorch Version (if applicable): NIL
Baremetal or Container (if container which image + tag): Base Ubuntu 18.04 image

Relevant Files

Dockerfile (5.9 KB)

spolisetty · September 1, 2022, 3:47pm

Hi,

We recommend you to use pre-built TensorRT containers to avoid setup-related issues, you can customize the image on top of it.

Or please refer to TensorRT/docker at main · NVIDIA/TensorRT · GitHub

Thank you.

realimposter · September 2, 2022, 5:31am

Hi,

I attempted to use the image 20.01-py3 from TensorRT | NVIDIA NGC. I changed my first line to nvcr.io/nvidia/tensorrt:20.01-py3, and then successfully converted a darknet file to onnx and then to tensorrt and generate expected inference results on my local, which runs a Tesla V100. As I understand, this container is using Tensorrt 7.0.0.11 (that’s the version when I imported), so it was identical to my previously built image in that sense.

However, when I switch to AWS that runs a GPU instance with Tesla T4 with the EXACT same image, it fails again. I tried 2 runs on AWS:

Doing the darknet to onnx conversion, followed by onnx to tensorrt conversion, both when building the image in AWS.
Doing the darknet to onnx conversion on my local, then copying the onnx file into aws and doing the onnx to tensorrt conversion there.

For 1: I receive the error -
151 conv 256 1 x 1 / 1 64 x 36 x 256 → 64 x 36 x 256The command ‘/bin/sh -c python3 /app/config/darknet2onnx/demo_darknet2onnx.py /app/config/mobius-yolov4-csp-exp24.cfg /app/config/mobius_exp24.names /app/config/mobius-yolov4-csp-exp24_best.weights /app/config/0.jpg 1’ returned a non-zero code: 137

For 2: I receive an identical error to above - [TensorRT] Error: CUDA initialization failure with error 35. Please check your CUDA installatiion: Installation Guide Linux :: CUDA Toolkit Documentation

I have these questions:

Should I be converting the darknet to onnx on the same machine that I convert onnx to tensorrt. Does this matter at all, and would it resolve the second error above?
Error code 137 seems to be memory issue from what I read. Why would this happen at all?
Should I upgrade to a higher version for the container? But again, why would it fix anything, since it runs perfectly on my local (Tesla V100). I checked the CUDA compute capability for both T4 (7.5) and V100 (7). From wikipedia: “CUDA SDK 10.0 – 10.2 support for compute capability 3.0 – 7.5 (Kepler, Maxwell, Pascal, Volta, Turing). Last version with support for compute capability 3.0 and 3.2 (Kepler in part). 10.2 is the last official release for macOS, as support will not be available for macOS in newer releases.” So shouldn’t CUDA 10.2, what I am using in this image, support the T4?

Please advice. This is a very bizarre and frustrating problem and I have thrown almost everything at it.

realimposter · September 2, 2022, 9:25am

I’ve checked the CUDA driver versions and found that mine are 450.51.06. Does that mean that something like Container Release Notes :: NVIDIA Deep Learning TensorRT Documentation, which states that " Release 20.07 is based on NVIDIA CUDA 11.0.194, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30.“”, should not have a driver issue as my driver is >= 450?

spolisetty · September 2, 2022, 10:10am

Are you facing this issue with generating the ONNX model?

We don’t need to generate the ONNX model on the same machine. But the TensorRT engine needs to be built on the same machine we run inference.

We couldn’t get it exactly, could you please share with us the error log. But (2) initialization error you’re facing could be due to package incompatibility, Please make sure you’re able to run the CUDA sample successfully before, CUDA Installation Guide for Linux

We recommend you please use the latest image and the latest TensorRT version. Version 20.07 is very old, there could be some issues which are resolved in the later versions.

realimposter · September 2, 2022, 4:16pm

Are you facing this issue with generating the ONNX model?
On replicating error code 137.
Yes, I am facing this issue. I rebuilt and tested the conversion again, with AWS Tesla T4 is running with drivers 450.51.06. I used tensorrt container 21.02-py3. I receive the exact same error.

From the documentation, it states: “Release 21.02 is based on [NVIDIA CUDA 11.2.0], which requires [NVIDIA Driver] release 460.27.04 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51(or later R450).” This clearly explicit states that it will work on my hardware and driver.

At the same time, it continues to work properly on my local, which has a Tesla V100 with driver 418.165.02, which isn’t even in the stated supported drivers.

As for the exact error log, there isn’t much detail on this error, other than the darknet model being printed out and the command being shown to exit with error code 137. Here is an excerpt of the very long logs:
"
149 conv 256 1 x 1 / 1 64 x 36 x 256 → 64 x 36 x 256
150 route 148
151 conv 256 1 x 1 / 1 64 x 36 x 256 → 64 x 36 x 256The command ‘/bin/sh -c python3 /app/config/darknet2onnx/demo_darknet2onnx.py /app/config/mobius-yolov4-csp-exp24.cfg /app/config/mobius_exp24.names /app/config/mobius-yolov4-csp-exp24_best.weights /app/config/0.jpg 1’ returned a non-zero code: 137

[Container] 2022/09/02 15:00:53 Command did not exit successfully docker build --no-cache --build-arg no_proxy=$no_proxy --build-arg NO_PROXY=$no_proxy --build-arg http_proxy=$http_proxy --build-arg HTTP_PROXY=$http_proxy --build-arg HTTPS_PROXY=$http_proxy --build-arg https_proxy=$http_proxy --build-arg RUNTIME_BASE=$RUNTIME_BASE --build-arg GPU=True --build-arg CUDNN_HALF=True --build-arg SQS_QUEUE_URL=$SQS_QUEUE_URL -t $REPOSITORY_URI:$STAGE . exit status 137
"
I’m not sure your recommendation on running the CUDA sample first would prove anything though, as this is a container provided by Nvidia, and runs perfectly on Tesla V100, but breaks explicitly on the AWS Tesla T4. So I’m not sure if there is any package incompatibility at all. In fact, the V100’s drivers were less explicitly matched than those of Tesla T4.

We recommend you please use the latest image and the latest TensorRT version. Version 20.07 is very old, there could be some issues which are resolved in the later versions.

Which version do you recommend? The issue is that my code was built in python3.7, and transitioning to higher python versions (which come with your higher image version) comes with additional code edits. Do you have any plausible insight into this issue? I already tried 6 of your containers to no avail, so what assurance is there that moving to a higher version and changing all my code will yield results?

spolisetty · September 8, 2022, 10:47am

Sorry, it’s not clear whether are you facing this issue on building the ONNX model or building the TensorRT engine. Were you able to successfully generate the ONNX model?

We recommended making sure CUDA is running correctly because the above error is more related to CUDA/driver.

You can try the latest TensorRT image 22.08, which has python 3.8, also python 3.8 doesn’t have many changes compared to 3.7.

Thank you.

realimposter · September 28, 2022, 2:01pm

Hi,

Thank you for the response.

I don’t think I have an issue with the onnx model. Anyways, I have a script that takes in the onnx model and tries to convert to tensorrt.

I tested my drivers using the ‘nvidia-smi’ command and I get:

So I have Cuda 11.2, and Driver 460.73.01
It is a Tesla T4 on an AWS instance.

I am using container 21.02-py3:
https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_21-02.html

It says it is running Cuda 11.2.0. It also requires driver release 460.27.04 or later.

So my Cuda version matches, my driver versions match. However, when I am running my trt_convert.py script, I get the error:
[TensorRT] ERROR: CUDA initialization failure with error 35. Please check your CUDA installation: CUDA Installation Guide for Linux
Traceback (most recent call last):
File “/app/config/trt_convert.py”, line 5, in
builder = trt.Builder(TRT_LOGGER)
TypeError: pybind11::init(): factory function returned nullptr
The command ‘/bin/sh -c python3 /app/config/trt_convert.py’ returned a non-zero code: 1

So after syncing up all the versions, I am confused why the issue is still happening. I looked up some others who get the same error:

github.com/wang-xinyu/tensorrtx

[E] [TRT] CUDA initialization failure with error 35.

opened 02:28PM - 23 Aug 21 UTC

closed 06:27PM - 10 Oct 21 UTC

shubhamgajbhiye1994

Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installat…ion-guide-linux/index.html I am tried ultralytics/yolov5:v5.0 , nvcr.io/nvidia/pytorch:latest , ultralytics/yolov5:latest to generate engine file for yolov5 small model but I am getting cuda initialization error. can you please guide what cuda version is need and ist supporting lib like tensorrt and other stuff to generate tensort file. yolov5:v5.0 has cuda 11.0 (preinstalled) yolov5:latest has cuda 11.3 (preinstalled) nvcr.io/nvidia/pytorch:latest:11.3 (preinstalled) Please , if you have link to old docker images please share with me

github.com/NVIDIA/TensorRT

TypeError: pybind11::init(): factory function returned nullptr.

opened 08:35AM - 02 Jan 20 UTC

closed 04:43PM - 02 Jan 20 UTC

chegnyanjun

invalid TODO

Hi,i am use tensorrt7.0.0.11 and cuda10.1 ubuntu16.04 pytorch1.3.1 python3.6.Bu…t i faced above problem when i was using it.It indices the problem from this line: ```python TRT_LOGGER = trt.Logger(trt.Logger.WARNING) with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.OnnxParser(network,TRT_LOGGER) as parser: #<--- this line got above problem. ``` The code can be use in the tensorrt6.0.1.5,but got problem in the tensorrt7.0.0.11. What's problem of my code?

github.com/tensorflow/tensorflow

ValueError: Failed to parse the model: pybind11::init(): factory function returned nullptr. when convert and quantize tf model

opened 08:20AM - 01 Feb 21 UTC

closed 01:54AM - 01 Apr 23 UTC

JiashuGuo

stat:awaiting response type:bug stale comp:lite TFLiteConverter TF 2.4

### System information - **Have I written custom code (as opposed to using …a stock example script provided in TensorFlow)**: Yes - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Linux Ubuntu 18.04 - **Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on a mobile device**: - **TensorFlow installed from (source or binary)**: source - **TensorFlow version (use command below)**: `import tensorflow.compat.v1 as tf tf.__version__ 2.4.1` - **Python version**: Python 3.6.9 - **Bazel version (if compiling from source)**: - **GCC/Compiler version (if compiling from source)**: - **CUDA/cuDNN version**: - **GPU model and memory**: - **Exact command to reproduce**: ### Describe the problem I am trying to covert a Feature Extraction model that used in deepSort tracking to a int8 quantized tflite model, I am following the post-training quantization but failed with an error. Actually, no matter what content in representative_data_gen() function, the error is same as in Traceback below: ### Source code / logs Here is the code that converting the frozen-graph to tflite: import tensorflow.compat.v1 as tf mnist = tf.keras.datasets.mnist (train_images, train_labels), (test_images, test_labels) = mnist.load_data() train_images = train_images.astype(np.float32) / 255.0 def representative_data_gen(): for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100): yield [input_value] converter = tf.lite.TFLiteConverter.from_frozen_graph("mars-small128.pb",input_arrays=["Cast"],output_arrays=["features"],input_shapes={"Cast":[1, 128, 64, 3]}) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 converter.representative_dataset = representative_data_gen tflite_model = converter.convert() open("converted_model.tflite", "wb").write(tflite_model) Traceback (most recent call last): File "/home/dev/.local/lib/python3.6/site-packages/tensorflow/lite/python/optimize/calibrator.py", line 58, in __init__ _calibration_wrapper.CalibrationWrapper(model_content)) TypeError: pybind11::init(): factory function returned nullptr During handling of the above exception, another exception occurred: Traceback (most recent call last): File "detect.py", line 219, in <module> main() File "detect.py", line 109, in main encoder = generate_detections.create_box_encoder("mars-small128.pb", batch_size = 32) File "/home/dev/projects/coral/examples-camera/opencv/generate_detections.py", line 191, in create_box_encoder image_encoder = ImageEncoder(model_filename, input_name, output_name) File "/home/dev/projects/coral/examples-camera/opencv/generate_detections.py", line 150, in __init__ tflite_model = converter.convert() File "/home/dev/.local/lib/python3.6/site-packages/tensorflow/lite/python/lite.py", line 1947, in convert return super(TFLiteConverter, self).convert() File "/home/dev/.local/lib/python3.6/site-packages/tensorflow/lite/python/lite.py", line 1313, in convert result = self._calibrate_quantize_model(result, **flags) File "/home/dev/.local/lib/python3.6/site-packages/tensorflow/lite/python/lite.py", line 449, in _calibrate_quantize_model calibrate_quantize = _calibrator.Calibrator(result) File "/home/dev/.local/lib/python3.6/site-packages/tensorflow/lite/python/optimize/calibrator.py", line 60, in __init__ raise ValueError("Failed to parse the model: %s." % e) ValueError: Failed to parse the model: pybind11::init(): factory function returned nullptr.

Some suggested solutions:

Inside /etc/docker/daemon.json add this line "default-runtime": "nvidia"
Exposing the gpu (But when I run nvidia-smi my Tesla T4 is detected)
Changing the container
user was not in docker group for user docker on nvidia. Add user account into docker group with sudo usermod -aG docker $USER
call `torch.cuda.current_device()’ first
Despite nvidia-smi working properly, could it still be badly installed? How do I know?

Could you suggest an approach to tackle the problem? It seems like 2 is not viable. I am not sure about 3, do you have issues with 21.02-py3? 4 seems to have worked for someone else. Do you think it is viable for me? I am an AWS user. But note I have managed to run torch code fine before. Would 5 work? Would 6 work?

I see that the container is also 11.2.0 CUDA. I am not sure what x my CUDA 11.2.x is, but perhaps it is due to that? Should I attempt 21.03-py3 which utilizes CUDA 11.2.1? Also, could the drivers be an issue? It says Tesla T4 MAY use 450 drivers, but is my driver here (460) workable? The tensorrt container says it works generically with 460.27.04