Probable bug for ONNX Range Op

Description

When running a onnx model, it gives an error:

05/05/2023-03:43:46] [E] [TRT] ModelImporter.cpp:746: ERROR: ModelImporter.cpp:186 In function parseGraph:
[6] Invalid Node - Range_93
[shapeContext.cpp::setShapeInterval::427] Error Code 2: Internal Error (Assertion success failed. intervals already set for the shape)

But in the document it seems Range Op should be supported : onnx-tensorrt/operators.md at 3b008c466bcb7375aaf5cabf51b289fd34d40c44 · onnx/onnx-tensorrt · GitHub
And I think the error message means the Range Op’s shape has been set twice, which may be a bug.

Environment

TensorRT Version: 8.5
GPU Type: Tesla T4
Nvidia Driver Version: 460

Relevant Files

I cut a subgraph of the origin onnx for your convenience to reproduce
subgraph.onnx (898 Bytes)

Steps To Reproduce

trtexec --onnx=subgraph.onnx

I just tried TensorRT8.6 EA. This problem goes away, but I didn’t find such fix in the TRT8.6 release notes. Anyway, the next problem using the trt8.6 is an error below:

[05/05/2023-18:30:11] [TRT] [V] =============== Computing costs for
[05/05/2023-18:30:11] [TRT] [V] *************** Autotuning format combination:  -> Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1), Float(64,1) ***************
[05/05/2023-18:30:11] [TRT] [V] --------------- Timing Runner: {ForeignNode[transformer.layers.0.attention.rotary_emb.inv_freq...Cast_11544]} (Myelin[0x80000023])
[05/05/2023-18:30:15] [TRT] [V] Skipping tactic 0 due to insufficient memory on requested size of 257698037760 detected for tactic 0x0000000000000000.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[05/05/2023-18:30:15] [TRT] [V] {ForeignNode[transformer.layers.0.attention.rotary_emb.inv_freq...Cast_11544]} (Myelin[0x80000023]) profiling completed in 4.39308 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[05/05/2023-18:30:16] [TRT] [V] Deleting timing cache: 1 entries, served 27 hits since creation.
[05/05/2023-18:30:16] [TRT] [E] 10: Could not find any implementation for node {ForeignNode[transformer.layers.0.attention.rotary_emb.inv_freq...Cast_11544]}.
[05/05/2023-18:30:16] [TRT] [E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[transformer.layers.0.attention.rotary_emb.inv_freq...Cast_11544]}.)

The bytes are too much for the GPU memory, how can I fix this ? this is actually a popular language model.

We could successfully build the engine.
Could you please try mentioning the workspace option and make sure enough GPU memory is available.

[05/08/2023-07:02:02] [I] === Performance summary ===
[05/08/2023-07:02:02] [I] Throughput: 9975.79 qps
[05/08/2023-07:02:02] [I] Latency: min = 0.0488281 ms, max = 7.22034 ms, mean = 0.0687405 ms, median = 0.0554199 ms, percentile(90%) = 0.0644531 ms, percentile(95%) = 0.0654297 ms, percentile(99%) = 0.0834961 ms
[05/08/2023-07:02:02] [I] Enqueue Time: min = 0.0272217 ms, max = 7.20581 ms, mean = 0.0535737 ms, median = 0.0400391 ms, percentile(90%) = 0.0491333 ms, percentile(95%) = 0.0519409 ms, percentile(99%) = 0.0640869 ms
[05/08/2023-07:02:02] [I] H2D Latency: min = 0.0133057 ms, max = 0.216431 ms, mean = 0.0155637 ms, median = 0.0155029 ms, percentile(90%) = 0.0157471 ms, percentile(95%) = 0.0158691 ms, percentile(99%) = 0.0172119 ms
[05/08/2023-07:02:02] [I] GPU Compute Time: min = 0.0268555 ms, max = 7.19678 ms, mean = 0.0455767 ms, median = 0.0322266 ms, percentile(90%) = 0.0411377 ms, percentile(95%) = 0.0419312 ms, percentile(99%) = 0.0569458 ms
[05/08/2023-07:02:02] [I] D2H Latency: min = 0.00634766 ms, max = 0.294922 ms, mean = 0.00760022 ms, median = 0.00769043 ms, percentile(90%) = 0.00790405 ms, percentile(95%) = 0.00805664 ms, percentile(99%) = 0.00836182 ms
[05/08/2023-07:02:02] [I] Total Host Walltime: 3.00016 s
[05/08/2023-07:02:02] [I] Total GPU Compute Time: 1.36407 s
[05/08/2023-07:02:02] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[05/08/2023-07:02:02] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[05/08/2023-07:02:02] [W] * GPU compute time is unstable, with coefficient of variance = 474.665%.
[05/08/2023-07:02:02] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.

&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=subgraph.onnx --memPoolSize=workspace:2000 --verbose

Hi, thanks for reply, I have no doubt the subgraph can be built with TRT8.6. I mentioned two questions in this post( My fault, maybe I should have opened a new one) The first question is that under trt8.5, the range op has problem. And the second quiestion is about the myelin, it happens on the whole graph, not in this subgraph. Maybe I should open a new post for my second problem and provide the whole onnx graph. And if you would like to see to the first problem, (trt8.5) might be useful .