Description
I’m trying to convert an onnx file to TensorRT engine by trtexec. But it raises the following error:
[02/09/2023-08:26:09] [W] [TRT] Skipping tactic 0x0000000000000000 due to Myelin error: autotuning: CUDA error 2
allocating 6443238909-byte buffer: out of memory
[02/09/2023-08:26:09] [E] Error[10]: [optimizer.cpp::computeCosts::3728] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::MatMul_9665 + (Unnamed Layer* 387) [Shuffle].../input_blocks.1/input_blocks.1.1/Reshape_2 + /input_blocks.1/input_blocks.1.1/Transpose_1 + /input_blocks.1/input_blocks.1.1/Reshape_3]}.)
[02/09/2023-08:26:09] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )[02/09/2023-08:26:09] [E] Engine could not be created from network
[02/09/2023-08:26:09] [E] Building engine failed
[02/09/2023-08:26:09] [E] Failed to create engine from model or file.
[02/09/2023-08:26:09] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8501] # trtexec --onnx=unet_ldm.onnx --saveEngine=unet_ldm.trt
Environment
nvidia docker container 22.12
Relevant Files
The model I use is the official stable-diffusion unet model.
Steps To Reproduce
Use this script to convert unet to onnx.
from omegaconf import OmegaConf
config = OmegaConf.load("../../stable-diffusion/configs/stable-diffusion/v1-inference.yaml")
config = config.model.params.unet_config
from ldm.util import instantiate_from_config
unet = instantiate_from_config(config)
import torch
model = unet.eval().cuda()
input_0 = torch.randn(6, 4, 64, 64, dtype=torch.float32).cuda()
input_1 = torch.tensor([1, 3, 7, 8, 9, 23], dtype=torch.int32).cuda()
input_2 = torch.randn(6, 77, 768, dtype=torch.float32).cuda()
with torch.no_grad():
torch.onnx.export(unet, (input_0, input_1, input_2), 'unet_ldm.onnx')
Then use trtexec --onnx=unet_ldm.onnx --saveEngine=unet_ldm.trt
to generate engine.
I get the following info with --verbose:
[02/09/2023-08:45:48] [V] [TRT] *************** Autotuning format combination: Float(1310720,4096,64,1), Float(59
136,768,1) -> Float(1310720,4096,64,1) ***************
[02/09/2023-08:45:48] [V] [TRT] --------------- Timing Runner: {ForeignNode[onnx::MatMul_9665 + (Unnamed Layer* 387) [Shuffle].../input_blocks.1/input_blocks.1.1/Reshape_2 + /input_blocks.1/input_blocks.1.1/Transpose_1 + /input_blocks.1/input_blocks.1.1/Reshape_3]} (Myelin)
[02/09/2023-08:45:48] [W] [TRT] Skipping tactic 0x0000000000000000 due to Myelin error: autotuning: CUDA error 2allocating 6443238909-byte buffer: out of memory
[02/09/2023-08:45:48] [V] [TRT] Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[02/09/2023-08:45:48] [V] [TRT] Deleting timing cache: 820 entries, served 8637 hits since creation.
[02/09/2023-08:45:49] [E] Error[10]: [optimizer.cpp::computeCosts::3728] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::MatMul_9665 + (Unnamed Layer* 387) [Shuffle].../input_blocks.1/input_blocks.1.1/Reshape_2 + /input_blocks.1/input_blocks.1.1/Transpose_1 + /input_blocks.1/input_blocks.1.1/Reshape_3]}.)
I’m not sure whether this is because of my memory not enough or trt internal bug.