[E] Error[1]: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument)

QQQQ · September 7, 2022, 9:22am

I am transferring my onnx model to trt model on agx.
I successfully dis that on jp4.4 trt 7 with the following command.

./trtexec --onnx=xxx1.onnx --saveEngine=xxx.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --verbose --fp16

But when i turn to trt 8.2, jp4.6 with the same command it throws out message like below

[09/07/2022-09:09:28] [I] Engine built in 240.911 sec.
[09/07/2022-09:09:28] [V] [TRT] Using cublas as a tactic source
[09/07/2022-09:09:28] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1384, GPU 17322 (MiB)
[09/07/2022-09:09:28] [V] [TRT] Using cuDNN as a tactic source
[09/07/2022-09:09:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1384, GPU 17322 (MiB)
[09/07/2022-09:09:28] [V] [TRT] Total per-runner device persistent memory is 421888
[09/07/2022-09:09:28] [V] [TRT] Total per-runner host persistent memory is 39168
[09/07/2022-09:09:28] [V] [TRT] Allocated activation device memory of size 1608192
[09/07/2022-09:09:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)
[09/07/2022-09:09:28] [I] Using random values for input input_1
[09/07/2022-09:09:28] [I] Created input binding for input_1 with dimensions 2x224x224x1
[09/07/2022-09:09:28] [I] Using random values for output concat
[09/07/2022-09:09:28] [I] Created output binding for concat with dimensions 2x12
[09/07/2022-09:09:28] [I] Using random values for output concat_before_shuffle
[09/07/2022-09:09:28] [I] Created output binding for concat_before_shuffle with dimensions -1x12x1x1
[09/07/2022-09:09:28] [I] Starting inference
[09/07/2022-09:09:28] [E] Error[1]: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument)
[09/07/2022-09:09:28] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8201] #

Here is my onnx model
test_nvidia.onnx (708.1 KB)

AastaLLL · September 8, 2022, 4:12am

Hi,

We have tested your model with JetPack 5.0.2 GA on Xavier.
It can work correctly so this should be a known issue and is fixed in the TensorRT 8.4.

$ /usr/src/tensorrt/bin/trtexec --onnx=test_nvidia.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=test_nvidia.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16
...
[09/08/2022-04:08:18] [I]
[09/08/2022-04:08:18] [I] === Performance summary ===
[09/08/2022-04:08:18] [I] Throughput: 1296.87 qps
[09/08/2022-04:08:18] [I] Latency: min = 0.775146 ms, max = 0.857544 ms, mean = 0.795577 ms, median = 0.7948 ms, percentile(99%) = 0.818726 ms
[09/08/2022-04:08:18] [I] Enqueue Time: min = 0.522705 ms, max = 0.784668 ms, mean = 0.57693 ms, median = 0.571777 ms, percentile(99%) = 0.681946 ms
[09/08/2022-04:08:18] [I] H2D Latency: min = 0.0172119 ms, max = 0.0831299 ms, mean = 0.0232541 ms, median = 0.0222168 ms, percentile(99%) = 0.0429688 ms
[09/08/2022-04:08:18] [I] GPU Compute Time: min = 0.751099 ms, max = 0.795776 ms, mean = 0.77008 ms, median = 0.769897 ms, percentile(99%) = 0.784668 ms
[09/08/2022-04:08:18] [I] D2H Latency: min = 0.00146484 ms, max = 0.0067749 ms, mean = 0.00224387 ms, median = 0.00219727 ms, percentile(99%) = 0.00415039 ms
[09/08/2022-04:08:18] [I] Total Host Walltime: 3.00183 s
[09/08/2022-04:08:18] [I] Total GPU Compute Time: 2.99792 s
[09/08/2022-04:08:18] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/08/2022-04:08:18] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=test_nvidia.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16

Thanks.

QQQQ · September 8, 2022, 4:32am

Have you tested on trt 8.2 and reproduced this error? And would you please explain more on why this happened and any solution to work around this error on trt 8.2?

AastaLLL · September 12, 2022, 3:24am

Hi,

We can reproduce the same error on TensorRT 8.2.

Do you have dependencies on JetPack 4.6.x?
If not, it’s recommended to upgrade the software to the latest for the solution.

Thanks.

QQQQ · September 13, 2022, 2:06am

Any more details? We just upgrade the whole project to jp4.6 trt 8.2, keep upgrading in such a short period is not possible.

AastaLLL · September 13, 2022, 8:22am

Hi,

There are some issues when handling the concatenate layer generated from PyTorch frameworks.

Since the concat layer is at the end of your model, you can just set the tensors right before the layer as output.
For example, TensorRT 8.2 can infer your model after the following updates:

1. Install Graph Surgeon

$ git clone https://github.com/NVIDIA/TensorRT.git
$ cd TensorRT/tools/onnx-graphsurgeon/
$ git checkout -b 8.2.1
$ make build
$ make install

2. Update model output

import onnx_graphsurgeon as gs
import onnx

graph = gs.import_onnx(onnx.load("test_nvidia.onnx"))
tmap = graph.tensors()

out_old  = tmap['Identity:0']
out_new1 = tmap['StatefulPartitionedCall/landmark_model/landmark/Sigmoid:0']
out_new2 = tmap['StatefulPartitionedCall/landmark_model/dense/Tanh:0']

out_new1.shape = ['unk__342', 8]
out_new2.shape = ['unk__342', 4]
out_new1.dtype = out_old.dtype
out_new2.dtype = out_old.dtype

graph.outputs = [out_new1, out_new2]
graph.cleanup()
onnx.save(gs.export_onnx(graph), "updated_model.onnx")

Test

$ /usr/src/tensorrt/bin/trtexec --onnx=updated_model.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16
...
[12/17/2021-00:26:23] [I] === Performance summary ===
[12/17/2021-00:26:23] [I] Throughput: 1148.36 qps
[12/17/2021-00:26:23] [I] Latency: min = 0.82959 ms, max = 1.32324 ms, mean = 0.859928 ms, median = 0.853027 ms, percentile(99%) = 1.11035 ms
[12/17/2021-00:26:23] [I] End-to-End Host Latency: min = 0.835449 ms, max = 1.33472 ms, mean = 0.870302 ms, median = 0.863434 ms, percentile(99%) = 1.11719 ms
[12/17/2021-00:26:23] [I] Enqueue Time: min = 0.557617 ms, max = 1.31372 ms, mean = 0.660193 ms, median = 0.623657 ms, percentile(99%) = 1.13599 ms
[12/17/2021-00:26:23] [I] H2D Latency: min = 0.0119629 ms, max = 0.0354004 ms, mean = 0.0133912 ms, median = 0.0129395 ms, percentile(99%) = 0.0224609 ms
[12/17/2021-00:26:23] [I] GPU Compute Time: min = 0.813965 ms, max = 1.27515 ms, mean = 0.844077 ms, median = 0.838135 ms, percentile(99%) = 1.06982 ms
[12/17/2021-00:26:23] [I] D2H Latency: min = 0.000976562 ms, max = 0.113525 ms, mean = 0.00245846 ms, median = 0.00170898 ms, percentile(99%) = 0.0283203 ms
[12/17/2021-00:26:23] [I] Total Host Walltime: 3.00168 s
[12/17/2021-00:26:23] [I] Total GPU Compute Time: 2.90953 s
[12/17/2021-00:26:23] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/17/2021-00:26:23] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=updated_model.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16

Thanks.

QQQQ · September 13, 2022, 8:59am

Actually, my ONNX model was generated from TensorFlow.

After testing i found that you just removed the last layer (concatenate layer) from the original network, in another word, changing the original one output to two outputs. Though this is not what I am expecting, let me try if it works, and do you have any multi-output samples in trt share with me?

AastaLLL · September 14, 2022, 4:48am

Hi,

You can find a python sample below:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#OpenCV_with_PLAN_model

Thanks.

QQQQ · September 15, 2022, 2:55am

Do you have one in C++?

AastaLLL · September 19, 2022, 5:22am

Hi,

Please check below for a sample that has two output tensors.

https://github.com/NVIDIA/TensorRT/blob/release/8.2/samples/sampleSSD/sampleSSD.cpp#L336

    ...
    const float* detectionOut = static_cast<const float*>(buffers.getHostBuffer("detection_out"));
    const int* keepCount = static_cast<const int*>(buffers.getHostBuffer("keep_count"));
    ...

Thanks.

system · October 12, 2022, 2:47am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using trtexec fails to convert onnx to tensorrt engine (DLAcore) FP16, but int8 works Jetson Xavier NX dla	7	1321	August 10, 2022
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4586	April 27, 2022
Erorr with onnx to trt Jetson Xavier NX tensorrt	8	1258	March 30, 2022
ConvTranspose + Add Slow TensorRT tensorrt	4	659	July 25, 2023
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	893	January 26, 2022
Error loading .trt model Jetson AGX Orin tensorrt	7	172	November 6, 2024
Separate GPU for Parallel on Jeton AgxOrin Jetson AGX Xavier gpu	19	55	August 28, 2024
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1033	September 28, 2022
Issues while converting ONNX to TRT Jetson Nano tensorrt , onnx	9	1280	October 18, 2021
Why different input size causes different performance? TensorRT	4	784	October 12, 2021

[E] Error[1]: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument)

1. Install Graph Surgeon

2. Update model output

Test

Related topics