Trtexec convert bloomz-7b1 failed due to no implementation for Reshape_7

Description

I’m trying to convert bigscience/bloomz-7b1 llm from onnx format to trt format on Jetson AGX Orin 64G, and it failed with following log:

[06/15/2023-17:15:20] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:20] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:20] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:20] [W] [TRT] Tactic Device request: 1024MB Available: 275MB. Device memory is insufficient to use tactic.
[06/15/2023-17:15:20] [W] [TRT] Skipping tactic 0 due to insufficient memory on requested size of 1024 detected for tactic 0x0000000000000001.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[06/15/2023-17:15:20] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:20] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:20] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:31] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:31] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:31] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:31] [W] [TRT] Tactic Device request: 1024MB Available: 350MB. Device memory is insufficient to use tactic.
[06/15/2023-17:15:31] [W] [TRT] Skipping tactic 0 due to insufficient memory on requested size of 1024 detected for tactic 0x0000000000000001.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[06/15/2023-17:15:31] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:31] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:32] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:33] [W] [TRT] Unknown embedded device detected. Using 59655MiB as the allocation cap for memory on embedded devices.
[06/15/2023-17:15:33] [W] [TRT] Tactic Device request: 401MB Available: 399MB. Device memory is insufficient to use tactic.
[06/15/2023-17:15:33] [W] [TRT] Skipping tactic 0 due to insufficient memory on requested size of 401 detected for tactic 0x0000000000000000.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[06/15/2023-17:15:37] [E] Error[10]: [optimizer.cpp::computeCosts::3728] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/transformer/h.10/self_attention/Cast.../transformer/h.10/self_attention/Reshape_7]}.)
[06/15/2023-17:15:37] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[06/15/2023-17:15:37] [E] Engine could not be created from network
[06/15/2023-17:15:37] [E] Building engine failed
[06/15/2023-17:15:37] [E] Failed to create engine from model or file.
[06/15/2023-17:15:37] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8502] # trtexec --onnx=bloomz_7b1.onnx --int8 --shapes=input_ids:1x1024 --saveEngine=bloomz_7b1_1x1024.trt

several problems showed in it:
1.Device memory is insufficient to use tactic.
2.(Could not find any implementation for node {ForeignNode[/transformer/h.10/self_attention/Cast…/transformer/h.10/self_attention/Reshape_7]}.)

For OOM warning, my device is 64GB ram version and I set 64GB swap, during the convert, it cost 59GB ram and 41GB swap, still 20GB swap left, I have no idea about this problem. You can see memory usage in the memory_usage.log attachment.
For no implementation of Reshape_7 node, I found several issues on github and it said limit memory usage with --workspace parameter, but I also found that --workspace is useless after TensorRT8.4 in the release note.

Environment

TensorRT Version: 8.5.2.2
GPU Type: Jetson AGX Orin(64GB ram)
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version: Ubuntu20.04
Python Version (if applicable): 3.8
TensorFlow Version (if applicable): -
PyTorch Version (if applicable): 2.0
Baremetal or Container (if container which image + tag): -

Relevant Files

memory_usage.log (645.1 KB)
convert.log (5.4 KB)

Steps To Reproduce

You can download the bloomz-7b1 model from hugging face, and export it to onnx and convert the onnx with trtexec.
Export pt to onnx:

from transformers import AutoModelForCausalLM, AutoTokenizer
import time

model_name = "bigscience/bloomz-7b1"
model_cache_path = r"./bloomz_7b1"

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=model_cache_path)
start_time = time.time()
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=model_cache_path, torch_dtype="auto", device_map="auto")
print(f"Load model time: {time.time() - start_time} s")

inputs = tokenizer.encode("Translate to English: Je t’aime.", return_tensors="pt").to(device)
start_time = time.time()
outputs = model.generate(inputs)
print(f"inference time: {time.time() - start_time} s")
print(inputs, outputs)
print(tokenizer.decode(outputs[0]))

torch.onnx.export(model,
                  inputs,
                  r"./models/bloomz_7b1/bloomz_7b1.onnx",
                  input_names=["input_ids"],
                  output_names=["outputs"],
                  dynamic_axes={"input_ids": {0: "batch", 1: "sequence"},
                                "outputs": {0: "batch", 1: "sequence"}},
                  opset_version=12)

convert onnx to trtexec:

trtexec --onnx=bloomz_7b1.onnx --int8 --shapes=input_ids:1x1024 --saveEngine=bloomz_7b1_1x1024.trt

Hi,

Could you please share the issue repro ONNX model here.
Please upload here or share via Google Drive or other platforms.

Thank you.

Hi,
Please refer to the below link for Sample guide.

Refer to the installation steps from the link if in case you are missing on anything

However suggested approach is to use TRT NGC containers to avoid any system dependency related issues.

In order to run python sample, make sure TRT python packages are installed while using NGC container.
/opt/tensorrt/python/python_setup.sh

In case, if you are trying to run custom model, please share your model and script with us, so that we can assist you better.
Thanks!

Hi,

The model file it really large so I updated failed several times. Can you follow the reproduce steps, downloads the model from hugging face and export it to onnx use my python script? It shall faster than get model from me.

Thanks.

Hi,

We could reproduce a similar error.
Please allow us some time to work on this issue.

Thank you.

Any progress?

bloom-7b1 and bloomz-7b1 will be supported in TensorRT’s future major releases.