[E] Error[1]: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument)

I am transferring my onnx model to trt model on agx.
I successfully dis that on jp4.4 trt 7 with the following command.

./trtexec --onnx=xxx1.onnx --saveEngine=xxx.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --verbose --fp16

But when i turn to trt 8.2, jp4.6 with the same command it throws out message like below

[09/07/2022-09:09:28] [I] Engine built in 240.911 sec.
[09/07/2022-09:09:28] [V] [TRT] Using cublas as a tactic source
[09/07/2022-09:09:28] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1384, GPU 17322 (MiB)
[09/07/2022-09:09:28] [V] [TRT] Using cuDNN as a tactic source
[09/07/2022-09:09:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1384, GPU 17322 (MiB)
[09/07/2022-09:09:28] [V] [TRT] Total per-runner device persistent memory is 421888
[09/07/2022-09:09:28] [V] [TRT] Total per-runner host persistent memory is 39168
[09/07/2022-09:09:28] [V] [TRT] Allocated activation device memory of size 1608192
[09/07/2022-09:09:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +2, now: CPU 0, GPU 2 (MiB)
[09/07/2022-09:09:28] [I] Using random values for input input_1
[09/07/2022-09:09:28] [I] Created input binding for input_1 with dimensions 2x224x224x1
[09/07/2022-09:09:28] [I] Using random values for output concat
[09/07/2022-09:09:28] [I] Created output binding for concat with dimensions 2x12
[09/07/2022-09:09:28] [I] Using random values for output concat_before_shuffle
[09/07/2022-09:09:28] [I] Created output binding for concat_before_shuffle with dimensions -1x12x1x1
[09/07/2022-09:09:28] [I] Starting inference
[09/07/2022-09:09:28] [E] Error[1]: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument)
[09/07/2022-09:09:28] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8201] # 

Here is my onnx model
test_nvidia.onnx (708.1 KB)

Hi,

We have tested your model with JetPack 5.0.2 GA on Xavier.
It can work correctly so this should be a known issue and is fixed in the TensorRT 8.4.

$ /usr/src/tensorrt/bin/trtexec --onnx=test_nvidia.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16
&&&& RUNNING TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=test_nvidia.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16
...
[09/08/2022-04:08:18] [I]
[09/08/2022-04:08:18] [I] === Performance summary ===
[09/08/2022-04:08:18] [I] Throughput: 1296.87 qps
[09/08/2022-04:08:18] [I] Latency: min = 0.775146 ms, max = 0.857544 ms, mean = 0.795577 ms, median = 0.7948 ms, percentile(99%) = 0.818726 ms
[09/08/2022-04:08:18] [I] Enqueue Time: min = 0.522705 ms, max = 0.784668 ms, mean = 0.57693 ms, median = 0.571777 ms, percentile(99%) = 0.681946 ms
[09/08/2022-04:08:18] [I] H2D Latency: min = 0.0172119 ms, max = 0.0831299 ms, mean = 0.0232541 ms, median = 0.0222168 ms, percentile(99%) = 0.0429688 ms
[09/08/2022-04:08:18] [I] GPU Compute Time: min = 0.751099 ms, max = 0.795776 ms, mean = 0.77008 ms, median = 0.769897 ms, percentile(99%) = 0.784668 ms
[09/08/2022-04:08:18] [I] D2H Latency: min = 0.00146484 ms, max = 0.0067749 ms, mean = 0.00224387 ms, median = 0.00219727 ms, percentile(99%) = 0.00415039 ms
[09/08/2022-04:08:18] [I] Total Host Walltime: 3.00183 s
[09/08/2022-04:08:18] [I] Total GPU Compute Time: 2.99792 s
[09/08/2022-04:08:18] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/08/2022-04:08:18] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=test_nvidia.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16

Thanks.

Have you tested on trt 8.2 and reproduced this error? And would you please explain more on why this happened and any solution to work around this error on trt 8.2?

Hi,

We can reproduce the same error on TensorRT 8.2.

Do you have dependencies on JetPack 4.6.x?
If not, it’s recommended to upgrade the software to the latest for the solution.

Thanks.

Any more details? We just upgrade the whole project to jp4.6 trt 8.2, keep upgrading in such a short period is not possible.

Hi,

There are some issues when handling the concatenate layer generated from PyTorch frameworks.

Since the concat layer is at the end of your model, you can just set the tensors right before the layer as output.
For example, TensorRT 8.2 can infer your model after the following updates:

1. Install Graph Surgeon

$ git clone https://github.com/NVIDIA/TensorRT.git
$ cd TensorRT/tools/onnx-graphsurgeon/
$ git checkout -b 8.2.1
$ make build
$ make install

2. Update model output

import onnx_graphsurgeon as gs
import onnx

graph = gs.import_onnx(onnx.load("test_nvidia.onnx"))
tmap = graph.tensors()

out_old  = tmap['Identity:0']
out_new1 = tmap['StatefulPartitionedCall/landmark_model/landmark/Sigmoid:0']
out_new2 = tmap['StatefulPartitionedCall/landmark_model/dense/Tanh:0']

out_new1.shape = ['unk__342', 8]
out_new2.shape = ['unk__342', 4]
out_new1.dtype = out_old.dtype
out_new2.dtype = out_old.dtype

graph.outputs = [out_new1, out_new2]
graph.cleanup()
onnx.save(gs.export_onnx(graph), "updated_model.onnx")

Test

$ /usr/src/tensorrt/bin/trtexec --onnx=updated_model.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16
...
[12/17/2021-00:26:23] [I] === Performance summary ===
[12/17/2021-00:26:23] [I] Throughput: 1148.36 qps
[12/17/2021-00:26:23] [I] Latency: min = 0.82959 ms, max = 1.32324 ms, mean = 0.859928 ms, median = 0.853027 ms, percentile(99%) = 1.11035 ms
[12/17/2021-00:26:23] [I] End-to-End Host Latency: min = 0.835449 ms, max = 1.33472 ms, mean = 0.870302 ms, median = 0.863434 ms, percentile(99%) = 1.11719 ms
[12/17/2021-00:26:23] [I] Enqueue Time: min = 0.557617 ms, max = 1.31372 ms, mean = 0.660193 ms, median = 0.623657 ms, percentile(99%) = 1.13599 ms
[12/17/2021-00:26:23] [I] H2D Latency: min = 0.0119629 ms, max = 0.0354004 ms, mean = 0.0133912 ms, median = 0.0129395 ms, percentile(99%) = 0.0224609 ms
[12/17/2021-00:26:23] [I] GPU Compute Time: min = 0.813965 ms, max = 1.27515 ms, mean = 0.844077 ms, median = 0.838135 ms, percentile(99%) = 1.06982 ms
[12/17/2021-00:26:23] [I] D2H Latency: min = 0.000976562 ms, max = 0.113525 ms, mean = 0.00245846 ms, median = 0.00170898 ms, percentile(99%) = 0.0283203 ms
[12/17/2021-00:26:23] [I] Total Host Walltime: 3.00168 s
[12/17/2021-00:26:23] [I] Total GPU Compute Time: 2.90953 s
[12/17/2021-00:26:23] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/17/2021-00:26:23] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8201] # /usr/src/tensorrt/bin/trtexec --onnx=updated_model.onnx --saveEngine=test_nvidia.trt --minShapes=input_1:0:1x224x224x1 --optShapes=input_1:0:2x224x224x1 --maxShapes=input_1:0:2x224x224x1 --workspace=4096 --fp16

Thanks.

Actually, my ONNX model was generated from TensorFlow.

After testing i found that you just removed the last layer (concatenate layer) from the original network, in another word, changing the original one output to two outputs. Though this is not what I am expecting, let me try if it works, and do you have any multi-output samples in trt share with me?
image

Hi,

You can find a python sample below:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#OpenCV_with_PLAN_model

Thanks.

Do you have one in C++?

Hi,

Please check below for a sample that has two output tensors.

https://github.com/NVIDIA/TensorRT/blob/release/8.2/samples/sampleSSD/sampleSSD.cpp#L336

    ...
    const float* detectionOut = static_cast<const float*>(buffers.getHostBuffer("detection_out"));
    const int* keepCount = static_cast<const int*>(buffers.getHostBuffer("keep_count"));
    ...

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.