Description
Transform onnx model to trt engine with static shape success but with dynamic shape failed. The failer is related to Conv2d with “group” parameter, like:“nn.Conv2d(640, 512, kernel_size=3, stride=1, padding=1, groups=2)”. where the “group” parameter is not 1. The pytorch model definision is the same(static vs dynamic). The torch2onnx file is almost the same expect dynamic shape related lines.
A clear and concise description of the bug or issue.
Error:
[11/11/2024-03:55:34] [E] [TRT] ModelImporter.cpp:951: — End node —
[11/11/2024-03:55:34] [E] [TRT] ModelImporter.cpp:954: ERROR: ModelImporter.cpp:181 In function parseNode:
[6] Invalid Node - /layers.10/Conv
Error Code: 3: /layers.10/Conv:kernel weights has count 1474560 but 737280 was expected
ITensor::getDimensions: Error Code 4: API Usage Error (/layers.10/Conv: count of 1474560 weights in kernel, but kernel dimensions (3,3) with 320 input channels, 512 output channels and 2 groups were specified. Expected Weights count is 320 * 3*3 * 512 / 2 = 737280)
Model definision:
class Encoder(nn.Module):
def __init__(self):
super(Encoder, self).__init__()
self.group = [1, 2, 4, 8, 1]
self.layers = nn.ModuleList([
nn.Conv2d(5, 64, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1, groups=1),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(640, 512, kernel_size=3, stride=1, padding=1, groups=2),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(768, 384, kernel_size=3, stride=1, padding=1, groups=4),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(640, 256, kernel_size=3, stride=1, padding=1, groups=8),
nn.LeakyReLU(0.2, inplace=True),
nn.Conv2d(512, 128, kernel_size=3, stride=1, padding=1, groups=1),
nn.LeakyReLU(0.2, inplace=True)
])
def forward(self, x):
bt, c, _, _ = x.size()
# h, w = h//4, w//4
out = x
for i, layer in enumerate(self.layers):
if i == 8:
x0 = out
_, _, h, w = x0.size()
if i > 8 and i % 2 == 0:
g = self.group[(i - 8) // 2]
x = x0.view(bt, g, -1, h, w)
o = out.view(bt, g, -1, h, w)
out = torch.cat([x, o], 2).view(bt, -1, h, w)
out = layer(out)
return out
torch2onnx codes:
torch.onnx.export(model,
input,
onnx_file_name,
# verbose=True,
export_params=True,
opset_version=18,
do_constant_folding=True,
input_names=[‘input’],
output_names = [‘output’],
dynamic_axes={
# ‘input’ : {0 : ‘time_domain1’},
‘input’ : {0 : ‘time_domain1’, 2: ‘height1’, 3: ‘width1’},
# ‘output’ : {0 : ‘time_domain1’},
‘output’ : {0 : ‘time_domain1’, 2: ‘height2’, 3: ‘width2’},
},
)
trtexec command:
trtexec --onnx=my_models/encode_dynamic_shape.onnx --saveEngine=my_models/encode_sim.trt --minShapes=input:13x5x240x240 --optShapes=input:17x5x360x360 --maxShapes=input:18x5x432x432
My Tries:
i observed that the input channel and output channel of conv2d is 640 and 512. When transformed to onnx, it becomes like this:
![image|474x499](upload://hssArxmPc0lcUYNC6kVcJW4aJo2.png)
640->320. becaurse of "group" parameter is 2. When onnx transformed to trt, the group parameter is used twice by trt again. The error is generated as above.
when i modify the onnx file, make the "groups" parameter to 1, transform the onnx to trt, i also got an Error:
Error:
[11/11/2024-06:31:11] [E] Error[3]: Error Code: 3: /layers.10/Conv:kernel weights has count 1474560 but 2949120 was expected
[11/11/2024-06:31:11] [E] Error[4]: IBuilder::buildSerializedNetwork: Error Code 4: API Usage Error (IConvolutionLayer /layers.10/Conv: /layers.10/Conv: count of 1474560 weights in kernel, but kernel dimensions (3,3) with 640 input channels, 512 output channels and 1 groups were specified. Expected Weights count is 640 * 3*3 * 512 / 1 = 2949120)
What should i do to fix that? is there any bug in trt?
Environment
TensorRT Version: 10.4
GPU Type: rtx 4070
Nvidia Driver Version: 550
CUDA Version: 12.4
CUDNN Version:
Operating System + Version: ubuntu 22.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered