Lstm model can not have more than one optimization profile

Description

We are trying to convert a lstm model running on a 4d tensor to tensorrt.We chose to intialize the h0 and c0 with mean of the tensor. That’s why the code is quite long. For more details the pytorch script is also provided. We want to use the multithreading functionality so we need more than 1 optimization profile. but the code gives

[TensorRT] ERROR: 2: [standardEngineBuilder.cpp::makeEngineFromGraph::1288] Error Code 2: Internal Error (Assertion engineRegions.count(it->name) == 0 failed.)

if I add the for loop to add multiple optimization profiles to config.

Environment

Docker Env:nvcr.io/nvidia/tensorrt:21.02-py3
GPU Type: tried both gtx 2080ti and gtx3090
Nvidia Driver Version: 470.57.02
PyTorch Version (if applicable): tried with pytorch1.4 and 1.6 and 1.7 with 1.7 has an no output bug when exporting

Relevant Files

https://drive.google.com/drive/folders/19OD2SapPcVkfLbqZh4j2YnqcWGfzHX1G?usp=sharing

Steps To Reproduce

run

python export2onnx.py

you can skip it since I also provide the onnx file
then just run

python onnx2tensor.py

full error message

for to onnx script there are some warning:

Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  con_h = torch.unsqueeze(con_h, 2).repeat(1, 1, int(feature_h), 1)
/home/agent_m/temp/minimal_case/lstm.py:54: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  con_c = torch.unsqueeze(con_c, 2).repeat(1, 1, int(feature_h), 1)
/home/agent_m/miniconda3/envs/pipeline_env_3/lib/python3.6/site-packages/torch/onnx/symbolic_opset9.py:1668: UserWarning: Exporting a model to ONNX with a batch_size other than 1, with a variable length with LSTM can cause an error when running the ONNX model with a different batch size. Make sure to save the model with a batch size of 1, or define the initial states (h0/c0) as inputs of the model. 
  "or define the initial states (h0/c0) as inputs of the model. ")

for tensorrt script:

[TensorRT] WARNING: onnx2trt_utils.cpp:362: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TensorRT] WARNING: onnx2trt_utils.cpp:390: One or more weights outside the range of INT32 was clamped
Completed parsing of ONNX file
onnx2tensorrt.py:28: DeprecationWarning: Use build_serialized_network instead.
  engine = builder.build_engine(network, config=trt_config)
[TensorRT] WARNING: Detected invalid timing cache, setup a local cache instead
[TensorRT] ERROR: 2: [standardEngineBuilder.cpp::makeEngineFromGraph::1288] Error Code 2: Internal Error (Assertion engineRegions.count(it->name) == 0 failed.)
Traceback (most recent call last):
  File "onnx2tensorrt.py", line 39, in <module>
    engine, context = build_engine('onnx_out.onnx')
  File "onnx2tensorrt.py", line 29, in build_engine
    context = engine.create_execution_context()
AttributeError: 'NoneType' object has no attribute 'create_execution_context'

I did int(feature_h) only because I tried to debug myself and tried to figure out if those dynamic numbers are the problems, it turns out they are not.

Hi,
The below link might be useful for you
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#thread-safety
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-priorities
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you to raise the query to the Deepstream or TRITON forum.

Thanks!

so you mean give up debugging and change to new tools?
Or you believe this bug will disappear on different device?
It has nothing to do with multithreading since there’s other usage for adding multiple optimization profiles.
As for the links. It’s not about multi-stream. It should be okay to add multiple optimization profiles to config and I’ve succesfully done that with another model. For the model I’m working on, if I tweak the code a little bit to get rid of some modules it also works. What I mean is to me it is an unexpected behavior on tensorrt side and I wonder what is so special about the code I provided that it triggered the bug. (repeat to dynamic size? repeat too many times?, combination of view, permute and repeat?) I provided a somewhat minimal example. I already spent whole day to locate the part of the pytorch code that’s problematic and thought about what makes it special, but I may still need help from experts.

The bug is on the engine building stage if I add more than 1 optimization profile to the config as I stated above. And there’s no google result if I search for the assertion on error message.

Thanks!