Kernel weights has count 2304 but 32640 was expected

jinshiksung · February 19, 2022, 6:23pm

Description

Try to build engine in Jetson TX2

I am getting this error

[02/19/2022-11:33:11] [E] [TRT] 3: (Unnamed Layer* 214) [Convolution]:kernel weights has count 2304 but 32640 was expected
[02/19/2022-11:33:11] [E] [TRT] 4: (Unnamed Layer* 214) [Convolution]: count of 2304 weights in kernel, but kernel dimensions (1,1) with 128 input channels, 255 output channels a nd 1 groups were specified. Expected Weights count is 128 * 11 * 255 / 1 = 32640
[02/19/2022-11:33:11] [E] [TRT] 4: [convolutionNode.cpp::computeOutputExtents::28] Error Code 4: Internal Error ((Unnamed Layer 214) [Convolution]: number of kernel weights does not match tensor dimensions)

Environment

TensorRT Version: 8.0
GPU Type: Jetson TX2
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version:
Operating System + Version: Ubuntu 18
Python Version (if applicable): Python 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.10
Baremetal or Container (if container which image + tag):

Relevant Files

This is weight file I generated

Steps To Reproduce

toolkits/deploy

Doing deployment part
I made it to build but
I am getting above mentioned error while creating engine

Please help me

Thanks

AastaLLL · February 21, 2022, 5:49am

Hi,

Have you tried to run the author’s model?
Please check if this error only occurs on the customized model or also appears in their default version.

Thanks.

jinshiksung · February 21, 2022, 2:21pm

I run the python version successfully

Now I am trying to run c++ version on TensorRT, toolkits/deploy
I did not modify author’s model. it should be default version

The error happens whey I try to create engine

the wts file attached is translated file for TensorRT from ‘weights/End-to-end.pth’

I updated source base and rebuild
error change to this

Loading weights: yolop.wts
[02/21/2022-18:21:26] [E] [TRT] 3: (Unnamed Layer* 214) [Convolution]:kernel weights has count 2304 but 6912 was expected
[02/21/2022-18:21:26] [E] [TRT] 4: (Unnamed Layer* 214) [Convolution]: count of 2304 weights in kernel, but kernel dimensions (1,1) with 128 input channels, 54 output channels and 1 groups were specified. Expected Weights count is 128 * 11 * 54 / 1 = 6912
[02/21/2022-18:21:26] [E] [TRT] 4: [convolutionNode.cpp::computeOutputExtents::28] Error Code 4: Internal Error ((Unnamed Layer 214) [Convolution]: number of kernel weights does not match tensor dimensions)
[02/21/2022-18:21:26] [E] [TRT] 3: (Unnamed Layer* 214) [Convolution]:kernel weights has count 2304 but 6912 was expected
[02/21/2022-18:21:26] [E] [TRT] 4: (Unnamed Layer* 214) [Convolution]: count of 2304 weights in kernel, but kernel dimensions (1,1) with 128 input channels, 54 output channels and 1 groups were specified. Expected Weights count is 128 * 1*1 * 54 / 1 = 6912

AastaLLL · March 10, 2022, 7:56am

Hi,

When you run the python source, you should get an ONNX model.
Could you try to generate the TensorRT engine with that model to see if it works?

$ /usr/src/tensorrt/bin/trtexec --onnx=[model]

Thanks.

jinshiksung · March 10, 2022, 6:43pm

I generate engine file however when run it
I am getting this error

I reinstall Jetson TX2 with latest os copy

Does Jetson TX2 default opencv come with CUDA support?

[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: data
[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: det
[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: seg
[03/10/2022-14:49:13] [E] [TRT] 3: Cannot find binding of given name: lane

terminate called after throwing an instance of ‘cv::Exception’
what(): OpenCV(4.1.1) /home/nvidia/host/build_opencv/nv_opencv/modules/core/include/opencv2/core/private.cuda.hpp:107: error: (-216:No CUDA support) The library is compiled without CUDA support in function ‘throw_no_cuda’

Aborted (core dumped)

AastaLLL · March 14, 2022, 5:21am

Hi,

Would you mind attaching the ONNX model with us?
The default OpenCV has CUDA support but only for those implementations that belong to the core library.

Suppose you also have a customized C++ implementation.
Would you mind sharing it with us as well?

Thanks.

jinshiksung · March 14, 2022, 11:02pm

Hi,

Here is the implementation

yolop-1280-1280.onnx is the onnx model

Thanks

AastaLLL · March 24, 2022, 9:41am

Hi,

We test your model and it can work correctly with TensorRT.
So please help to debug your implementation.
inference.py (2.1 KB)

$ /usr/src/tensorrt/bin/trtexec --onnx=deploy/yolop-1280-1280.onnx --saveEngine=model.trt
$ python3 inference.py

You can also find a C++ example below:

/usr/src/tensorrt/samples/sampleMNIST

Thanks.

jinshiksung · March 25, 2022, 7:24pm

It runs but producing empty output
feeding 1280x1280 jpg

npargmax: 0 on output 2 and output 3
output1 number is too high isn’t it?

image = cv2.imread("road.jpg")
image = (2.0 / 255.0) * image.transpose((2, 0, 1)) - 1.0
np.copyto(host_inputs[0], image.ravel())

[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +234, GPU +0, now: CPU 329, GPU 5737 (MiB)
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 329, GPU 5737 (MiB)
[TensorRT] INFO: Loaded engine size: 82 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 412 MiB, GPU 5819 MiB
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +168, now: CPU 590, GPU 5998 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +250, GPU +251, now: CPU 840, GPU 6249 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 839, GPU 6249 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 839 MiB, GPU 6249 MiB
(1, 3, 1280, 1280)
259
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 804 MiB, GPU 6214 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 804, GPU 6214 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 804, GPU 6214 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 805 MiB, GPU 6306 MiB
execute times 0.6288807392120361
47034
0
0
output 1: (1, 100800, 6) 47034
output 2: (1, 2, 1280, 1280) 0
output 3: (1, 2, 1280, 1280) 0
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 797, GPU 6309 (MiB)

AastaLLL · March 28, 2022, 7:56am

Hi,

It seems the pre-processing is different.
Please try to apply the following to see if it helps:

github.com

hustvl/YOLOP/blob/main/test_onnx.py#L60


      
          for oo in outputs_info:
              print("Output: ", oo)
          
          
print("num outputs: ", len(outputs_info))
          
          
save_det_path = f"./pictures/detect_onnx.jpg"
          save_da_path = f"./pictures/da_onnx.jpg"
          save_ll_path = f"./pictures/ll_onnx.jpg"
          save_merge_path = f"./pictures/output_onnx.jpg"
          
          
img_bgr = cv2.imread(img_path)
          height, width, _ = img_bgr.shape
          
          
# convert to RGB
          img_rgb = img_bgr[:, :, ::-1].copy()
          
          
# resize & normalize
          canvas, r, dw, dh, new_unpad_w, new_unpad_h = resize_unscale(img_rgb, (640, 640))
          
          
img = canvas.copy().astype(np.float32)  # (3,640,640) RGB
          img /= 255.0

More, based on the Netron visualizer.
The n value of det_out (1xnx6) is really 100800.

Thanks.

jinshiksung · March 28, 2022, 4:54pm

I made modification

it still produce empty output
only difference is that
output 1: (1, 100800, 6) 47034–> output 1: (1, 100800, 6) 576234

Can you tell me what is wrong?

output 1: (1, 100800, 6) 576234<----np.argmax value
output 2: (1, 2, 1280, 1280) 0<----np.argmax value
output 3: (1, 2, 1280, 1280) 0<----np.argmax value

Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 792, GPU 4417 (MiB)
jinshiksung@Jetson:~/YOLOP-main$ python3 tools/inference1.py
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +234, GPU +0, now: CPU 329, GPU 3948 (MiB)
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 329, GPU 3948 (MiB)
[TensorRT] INFO: Loaded engine size: 82 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 412 MiB, GPU 4030 MiB
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +168, now: CPU 590, GPU 4207 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +250, GPU +252, now: CPU 840, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 840, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 840 MiB, GPU 4459 MiB
960 1280
(1, 3, 1280, 1280)
522006
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 839 MiB, GPU 4459 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 839, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 840, GPU 4459 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 840 MiB, GPU 4460 MiB
execute times 0.012740960000000356
576234
0
0
output 1: (1, 100800, 6) 576234
output 2: (1, 2, 1280, 1280) 0
output 3: (1, 2, 1280, 1280) 0
process times 0.7451001599999998
analyze times 0.8673133120000003
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 806, GPU 4454 (MiB)

AastaLLL · March 29, 2022, 5:54am

Thanks for your feedback.

We are going to check this in deep.
Will share more information with you later.

AastaLLL · April 6, 2022, 7:06am

Hi,

Thanks for your patience.

It seems the model has its own customized parser.
We apply the NMS implementation from the author and be able to get the correct output bboxes.

https://github.com/hustvl/YOLOP/blob/main/lib/core/general.py#L98

But it’s recommended to re-implement the NMS with the NumPy library.
This will allow you to run the inference without PyTorch and TorchVision to save time and memory.

tensorrt_inference.py (6.7 KB)

$ /usr/src/tensorrt/bin/trtexec --onnx=yolop-1280-1280.onnx --saveEngine=model.trt
$  python3 tensorrt_inference.py

Thanks.

jinshiksung · April 6, 2022, 1:00pm

How about two other outputs?
They are still 0 np.argmax value

…

det_out = host_outputs[0].reshape(engine.get_binding_shape(1))
out2 = host_outputs[1].reshape(engine.get_binding_shape(2))
out3 = host_outputs[2].reshape(engine.get_binding_shape(3))


print('output 2: '+str(out2.shape)+' '+str(np.argmax(out2)))
print('output 3: '+str(out3.shape)+' '+str(np.argmax(out3)))

boxes = non_max_suppression(torch.Tensor(det_out))[0]

…

and the execute time is spke to 1.29 sec

[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +231, GPU +0, now: CPU 319, GPU 3581 (MiB)
[TensorRT] INFO: [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 319, GPU 3581 (MiB)
[TensorRT] INFO: Loaded engine size: 82 MB
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine begin: CPU 402 MiB, GPU 3663 MiB
[TensorRT] WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +169, now: CPU 580, GPU 3843 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +250, GPU +251, now: CPU 830, GPU 4094 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 829, GPU 4094 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] deserializeCudaEngine end: CPU 829 MiB, GPU 4094 MiB
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation begin: CPU 775 MiB, GPU 4041 MiB
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +0, now: CPU 776, GPU 4041 (MiB)
[TensorRT] INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 776, GPU 4041 (MiB)
[TensorRT] INFO: [MemUsageSnapshot] ExecutionContext creation end: CPU 776 MiB, GPU 4045 MiB
execute times 1.2920658588409424
output 2: (1, 2, 1280, 1280) 0
output 3: (1, 2, 1280, 1280) 0
detect 13 bounding boxes.
[TensorRT] INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 753, GPU 4031 (MiB)

AastaLLL · April 7, 2022, 2:48am

Hi,

We didn’t see the author use these outputs on their testing script.
Do you get the output value when using other frameworks for inference?

Thanks.

terpysk · April 15, 2022, 5:44pm

@AastaLLL yes. The author did use 3 outputs in their test script.

Yes using the onnx-640, we are able to get three outputs (YOLOP/test_onnx.py at main · hustvl/YOLOP · GitHub)

AastaLLL · April 21, 2022, 6:55am

Hi,

Do you get the correct result now?
If not, could you share the source and expected value of the following two outputs?

Thanks.

terpysk · April 21, 2022, 2:13pm

yolop-640-640.onnx (34.2 MB)
test_onnx.py (10.6 KB)

Download three attachements and run:

python3 test_onnx.py
Load ./weights/yolop-640-640.onnx done!
Input: NodeArg(name=‘images’, type=‘tensor(float)’, shape=[1, 3, 640, 640])
Output: NodeArg(name=‘det_out’, type=‘tensor(float)’, shape=[1, 25200, 6])
Output: NodeArg(name=‘drive_area_seg’, type=‘tensor(float)’, shape=[1, 2, 640, 640])
Output: NodeArg(name=‘lane_line_seg’, type=‘tensor(float)’, shape=[1, 2, 640, 640])
num outputs: 3
detect 14 bounding boxes.
(360, 640)
(360, 640)
detect done.
da seg shape: (720, 1280), max: True
ll_seg shape: (720, 1280), max: True

We will get outputs for da_seg and ll_seg and the max value is 1
To visualize the output would be like this:

terpysk · April 22, 2022, 1:01pm

The onnx version works. Tensorrt version is still not working and is slow

jinshiksung · April 22, 2022, 1:06pm

Please help me to get the results from Tensorrt version to speed up a bit

Topic		Replies	Views
Model onnx trt engine generation process report different results compared between PC and jetson XAVIER NX Jetson Xavier NX tensorrt	19	1023	September 28, 2022
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1415	July 12, 2022
ERORR with ONNX2TRT : Unknown embedded device detected Jetson Xavier NX onnx	18	4575	April 27, 2022
Tensorrt can not speed up well TensorRT	7	1622	June 29, 2022
Inference of model using tensorflow/onnxruntime and TensorRT gives different result Jetson TX2 tensorrt	20	2530	October 18, 2021
Erorr with onnx to trt Jetson Xavier NX tensorrt	8	1250	March 30, 2022
How to generate the correct engine with tensorrt for Yolov3 TAO Toolkit	8	1073	July 22, 2023
TensorRT get different result in python and c++ TensorRT	21	2886	August 24, 2022
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	963	September 29, 2022
Tensorflow model acceleration on AGX Jetson AGX Xavier tensorflow	14	1192	October 7, 2022

Kernel weights has count 2304 but 32640 was expected

Description

Environment

Relevant Files

Steps To Reproduce

Related topics