Kernel weights has count 2304 but 32640 was expected

Description

Try to build engine in Jetson TX2

I am getting this error

[02/19/2022-11:33:11] [E] [TRT] 3: (Unnamed Layer* 214) [Convolution]:kernel weights has count 2304 but 32640 was expected
[02/19/2022-11:33:11] [E] [TRT] 4: (Unnamed Layer* 214) [Convolution]: count of 2304 weights in kernel, but kernel dimensions (1,1) with 128 input channels, 255 output channels a nd 1 groups were specified. Expected Weights count is 128 * 11 * 255 / 1 = 32640
[02/19/2022-11:33:11] [E] [TRT] 4: [convolutionNode.cpp::computeOutputExtents::28] Error Code 4: Internal Error ((Unnamed Layer
214) [Convolution]: number of kernel weights does not match tensor dimensions)

Environment

TensorRT Version: 8.0
GPU Type: Jetson TX2
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Ubuntu 18
Python Version (if applicable): Python 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.10
Baremetal or Container (if container which image + tag):

Relevant Files

This is weight file I generated

Steps To Reproduce

toolkits/deploy

Doing deployment part
I made it to build but
I am getting above mentioned error while creating engine

Please help me

Thanks

Hi,

Have you tried to run the author’s model?
Please check if this error only occurs on the customized model or also appears in their default version.

Thanks.

I run the python version successfully

Now I am trying to run c++ version on TensorRT, toolkits/deploy
I did not modify author’s model. it should be default version

The error happens whey I try to create engine

the wts file attached is translated file for TensorRT from ‘weights/End-to-end.pth’

I updated source base and rebuild
error change to this

Loading weights: yolop.wts
[02/21/2022-18:21:26] [E] [TRT] 3: (Unnamed Layer* 214) [Convolution]:kernel weights has count 2304 but 6912 was expected
[02/21/2022-18:21:26] [E] [TRT] 4: (Unnamed Layer* 214) [Convolution]: count of 2304 weights in kernel, but kernel dimensions (1,1) with 128 input channels, 54 output channels and 1 groups were specified. Expected Weights count is 128 * 11 * 54 / 1 = 6912
[02/21/2022-18:21:26] [E] [TRT] 4: [convolutionNode.cpp::computeOutputExtents::28] Error Code 4: Internal Error ((Unnamed Layer
214) [Convolution]: number of kernel weights does not match tensor dimensions)
[02/21/2022-18:21:26] [E] [TRT] 3: (Unnamed Layer* 214) [Convolution]:kernel weights has count 2304 but 6912 was expected
[02/21/2022-18:21:26] [E] [TRT] 4: (Unnamed Layer* 214) [Convolution]: count of 2304 weights in kernel, but kernel dimensions (1,1) with 128 input channels, 54 output channels and 1 groups were specified. Expected Weights count is 128 * 1*1 * 54 / 1 = 6912

Hi,

When you run the python source, you should get an ONNX model.
Could you try to generate the TensorRT engine with that model to see if it works?

$ /usr/src/tensorrt/bin/trtexec --onnx=[model]

Thanks.

I generate engine file however when run it
I am getting this error

I reinstall Jetson TX2 with latest os copy

Does Jetson TX2 default opencv come with CUDA support?

[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: data
[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: det
[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: seg
[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: lane

terminate called after throwing an instance of ‘cv::Exception’
what(): OpenCV(4.1.1) /home/nvidia/host/build_opencv/nv_opencv/modules/core/include/opencv2/core/private.cuda.hpp:107: error: (-216:No CUDA support) The library is compiled without CUDA support in function ‘throw_no_cuda’

Aborted (core dumped)

Hi,

Would you mind attaching the ONNX model with us?
The default OpenCV has CUDA support but only for those implementations that belong to the core library.

Suppose you also have a customized C++ implementation.
Would you mind sharing it with us as well?

Thanks.

Hi,

Here is the implementation

yolop-1280-1280.onnx is the onnx model

Thanks

Hi,

We test your model and it can work correctly with TensorRT.
So please help to debug your implementation.
inference.py (2.1 KB)

$ /usr/src/tensorrt/bin/trtexec --onnx=deploy/yolop-1280-1280.onnx --saveEngine=model.trt
$ python3 inference.py

You can also find a C++ example below:

/usr/src/tensorrt/samples/sampleMNIST

Thanks.

It runs but producing empty output
feeding 1280x1280 jpg

npargmax: 0 on output 2 and output 3
output1 number is too high isn’t it?

image = cv2.imread("road.jpg")
image = (2.0 / 255.0) * image.transpose((2, 0, 1)) - 1.0
np.copyto(host_inputs[0], image.ravel())

[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +234, GPU +0, now: CPU 329, GPU 5737 (MiB)
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 329, GPU 5737 (MiB)
[TensorRT] INFO: Loaded engine size: 82 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 412 MiB, GPU 5819 MiB
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +168, now: CPU 590, GPU 5998 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +250, GPU +251, now: CPU 840, GPU 6249 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 839, GPU 6249 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 839 MiB, GPU 6249 MiB
(1, 3, 1280, 1280)
259
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 804 MiB, GPU 6214 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 804, GPU 6214 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 804, GPU 6214 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 805 MiB, GPU 6306 MiB
execute times 0.6288807392120361
47034
0
0
output 1: (1, 100800, 6) 47034
output 2: (1, 2, 1280, 1280) 0
output 3: (1, 2, 1280, 1280) 0
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 797, GPU 6309 (MiB)

Hi,

It seems the pre-processing is different.
Please try to apply the following to see if it helps:

More, based on the Netron visualizer.
The n value of det_out (1xnx6) is really 100800.

Thanks.

I made modification

it still produce empty output
only difference is that
output 1: (1, 100800, 6) 47034–> output 1: (1, 100800, 6) 576234

Can you tell me what is wrong?

output 1: (1, 100800, 6) 576234<----np.argmax value
output 2: (1, 2, 1280, 1280) 0<----np.argmax value
output 3: (1, 2, 1280, 1280) 0<----np.argmax value

Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 792, GPU 4417 (MiB)
jinshiksung@Jetson:~/YOLOP-main$ python3 tools/inference1.py
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +234, GPU +0, now: CPU 329, GPU 3948 (MiB)
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 329, GPU 3948 (MiB)
[TensorRT] INFO: Loaded engine size: 82 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 412 MiB, GPU 4030 MiB
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +168, now: CPU 590, GPU 4207 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +250, GPU +252, now: CPU 840, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 840, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 840 MiB, GPU 4459 MiB
960 1280
(1, 3, 1280, 1280)
522006
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 839 MiB, GPU 4459 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 839, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 840, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 840 MiB, GPU 4460 MiB
execute times 0.012740960000000356
576234
0
0
output 1: (1, 100800, 6) 576234
output 2: (1, 2, 1280, 1280) 0
output 3: (1, 2, 1280, 1280) 0
process times 0.7451001599999998
analyze times 0.8673133120000003
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 806, GPU 4454 (MiB)

Thanks for your feedback.

We are going to check this in deep.
Will share more information with you later.

Hi,

Thanks for your patience.

It seems the model has its own customized parser.
We apply the NMS implementation from the author and be able to get the correct output bboxes.

https://github.com/hustvl/YOLOP/blob/main/lib/core/general.py#L98

But it’s recommended to re-implement the NMS with the NumPy library.
This will allow you to run the inference without PyTorch and TorchVision to save time and memory.

tensorrt_inference.py (6.7 KB)

$ /usr/src/tensorrt/bin/trtexec --onnx=yolop-1280-1280.onnx --saveEngine=model.trt
$  python3 tensorrt_inference.py

Thanks.

How about two other outputs?
They are still 0 np.argmax value

det_out = host_outputs[0].reshape(engine.get_binding_shape(1))
out2 = host_outputs[1].reshape(engine.get_binding_shape(2))
out3 = host_outputs[2].reshape(engine.get_binding_shape(3))


print('output 2: '+str(out2.shape)+' '+str(np.argmax(out2)))
print('output 3: '+str(out3.shape)+' '+str(np.argmax(out3)))

boxes = non_max_suppression(torch.Tensor(det_out))[0]

and the execute time is spke to 1.29 sec

[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +231, GPU +0, now: CPU 319, GPU 3581 (MiB)
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 319, GPU 3581 (MiB)
[TensorRT] INFO: Loaded engine size: 82 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 402 MiB, GPU 3663 MiB
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +169, now: CPU 580, GPU 3843 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +250, GPU +251, now: CPU 830, GPU 4094 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 829, GPU 4094 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 829 MiB, GPU 4094 MiB
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 775 MiB, GPU 4041 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 776, GPU 4041 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 776, GPU 4041 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 776 MiB, GPU 4045 MiB
execute times 1.2920658588409424
output 2: (1, 2, 1280, 1280) 0
output 3: (1, 2, 1280, 1280) 0
detect 13 bounding boxes.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 753, GPU 4031 (MiB)

Hi,

We didn’t see the author use these outputs on their testing script.
Do you get the output value when using other frameworks for inference?

Thanks.

@AastaLLL yes. The author did use 3 outputs in their test script.

Yes using the onnx-640, we are able to get three outputs (YOLOP/test_onnx.py at main · hustvl/YOLOP · GitHub)

Hi,

Do you get the correct result now?
If not, could you share the source and expected value of the following two outputs?

Thanks.

yolop-640-640.onnx (34.2 MB)
test_onnx.py (10.6 KB)

Download three attachements and run:

python3 test_onnx.py
Load ./weights/yolop-640-640.onnx done!
Input: NodeArg(name=‘images’, type=‘tensor(float)’, shape=[1, 3, 640, 640])
Output: NodeArg(name=‘det_out’, type=‘tensor(float)’, shape=[1, 25200, 6])
Output: NodeArg(name=‘drive_area_seg’, type=‘tensor(float)’, shape=[1, 2, 640, 640])
Output: NodeArg(name=‘lane_line_seg’, type=‘tensor(float)’, shape=[1, 2, 640, 640])
num outputs: 3
detect 14 bounding boxes.
(360, 640)
(360, 640)
detect done.
da seg shape: (720, 1280), max: True
ll_seg shape: (720, 1280), max: True

We will get outputs for da_seg and ll_seg and the max value is 1
To visualize the output would be like this:

1 Like

The onnx version works. Tensorrt version is still not working and is slow

1 Like

Please help me to get the results from Tensorrt version to speed up a bit