Error with Concat layer

Hello,

When trying to convert caffe googlenet model to TensorRT model, I met somme issues.
My network is a googlenet used as feature extractor in the Faster rcnn algorithm.
The RPROIFused layer provided by the library works fine and is not the problem. I have managed to run the Faster Rcnn sample with success.

I get the following error message

ERROR: inception_5a/output: all concat input tensors must have the same dimensions except on the concatenation axis
model_optimizer: ../common/enginehelper.h:89: nvinfer1::DimsCHW enginehelper::getCHW(const nvinfer1::Dims&): Assertion `d.nbDims >= 3' failed.

The problem seems to come from the Concat layer. I think that the blob dimensions are correct.
Bottom blob dimensions are
(1, 256, 7, 7)
(1, 320, 7, 7)
(1, 128, 7, 7)
(1, 128, 7, 7)

and the requested top dimension is
(1, 832, 7, 7)

However, in the googlenet sample, the Concat layer with the same number of blobs is used with success.
Is it a known issue ? Did I miss something ?

This issue is related to the followinf one (but with TensorRT 4)
https://devtalk.nvidia.com/default/topic/1045943/tensorrt/faster-rcnn-using-googlenet-as-feature-extractor-in-tensorrt-4/post/5311225/#5311225

I use TensorRT 5.0.2, CUDA 10, cudnn 7.4.2, RTX 2080 on Ubuntu 18.04 with driver 410.72.

Hello,

we are triaging and will keep you updated.

Hello,

To help us debug, can you please share repro that contains the model/network and source that demonstrates the errors you are seeing? we are not seeing this issue locally:

/infer_perf --deploy=/home/mvillmow/Downloads/faster_rcnn_test_iplugin_googlenet_v3_0.prototxt --output=cls_prob
&&&& RUNNING TensorRT.infer_perf # ./infer_perf --deploy=/home/mvillmow/Downloads/faster_rcnn_test_iplugin_googlenet_v3_0.prototxt --output=cls_prob
[I] deploy: /home/mvillmow/Downloads/faster_rcnn_test_iplugin_googlenet_v3_0.prototxt
[I] output: cls_prob
[I] Running on CUDA device: TITAN V (1.455 GHz, 80 SMs, mem 0.85 GHz, ECC disabled, 3072 bits, Compute Capability 7.0)
[I] Default InternalBuildFlags = 406f
[I] Updating InternalBuildFlags = 406f
[I] Input "data": 3x224x224
[I] Input "im_info": 1x1x3
[I] Output "cls_prob": 300x21x1
[I] Average over 10 runs is 13.6898 ms (host walltime is 16.2224 ms, enqueue time is 0.792459 ms, 99% percentile time is 14.6647 ms)
[I] Average over 10 runs is 13.3437 ms (host walltime is 16.0322 ms, enqueue time is 0.702377 ms, 99% percentile time is 14.9454 ms)
[I] Average over 10 runs is 13.6833 ms (host walltime is 16.1863 ms, enqueue time is 0.785873 ms, 99% percentile time is 15.2731 ms)
[I] Average over 10 runs is 13.5749 ms (host walltime is 16.1928 ms, enqueue time is 0.822019 ms, 99% percentile time is 14.9616 ms)
[I] Average over 10 runs is 13.3563 ms (host walltime is 16.0115 ms, enqueue time is 0.826268 ms, 99% percentile time is 14.9582 ms)
[I] Average over 10 runs is 13.25 ms (host walltime is 15.8614 ms, enqueue time is 0.848779 ms, 99% percentile time is 14.9979 ms)
[I] Average over 10 runs is 13.3783 ms (host walltime is 15.9838 ms, enqueue time is 0.787547 ms, 99% percentile time is 14.1146 ms)
[I] Average over 10 runs is 13.6333 ms (host walltime is 16.1485 ms, enqueue time is 0.876499 ms, 99% percentile time is 14.9617 ms)
[I] Average over 10 runs is 13.2522 ms (host walltime is 15.7596 ms, enqueue time is 0.892397 ms, 99% percentile time is 14.661 ms)
[I] Average over 10 runs is 13.5383 ms (host walltime is 16.1853 ms, enqueue time is 0.917691 ms, 99% percentile time is 14.2132 ms)
[I] CUDA device throughput: 14.8992 TFLOPS fp32 (@ 1.455 GHz), 522.24 GB/s practical achievable mem bw (@ 0.85 GHz 3072 bits, assuming 0.8 achievable/theoretical peak ratio)
[I] NOTE: The SOL analysis is based on achievable memory bandwidth and on the SOL calculated using direct convolution formula; FFT- and Winograd-based algos might report >100% of SOL this way
[I] Computational efficiency (assuming all memory bw bound layers achieve practical peak) is 2.54%
[I] Median runtime is 13.4583 ms (host time is 16.0846 ms, host enqueue time is 0.852619 ms, 99% percentile time is 14.1639 ms). Median throughput is 74.3037 infers per second.
&&&& PASSED TensorRT.infer_perf # ./infer_perf --deploy=/home/mvillmow/Downloads/faster_rcnn_test_iplugin_googlenet_v3_0.prototxt --output=cls_prob

Hello,

I send you a private message with the requested data.

Thanks for your help :)

Hello,

Per engineering:

After I get it to compile and load the attached deploy_plugin.txt and weights_modif.caffemodel. This is the output I get:

***** Optimisation modele *****

Librairie Caffe
Caffe model file : /home/mvillmow/p4/mvillmow-tensorrt/sw/gpgpu/MachineLearning/DIT/release/5.1/build/x86_64-linux/d.pry
Caffe weights file : /home/mvillmow/p4/mvillmow-tensorrt/sw/gpgpu/MachineLearning/DIT/release/5.1/build/x86_64-linux/w.pry
Caffe mean file :
Nb sorties reseau : 3
Batch size : 1
Precision : FP 32
Nb custom layers : 1



Fichier /home/mvillmow/p4/mvillmow-tensorrt/sw/gpgpu/MachineLearning/DIT/release/5.1/build/x86_64-linux/weights.pry genere

Is this correct output?

Hello,

Per engineering, this has been addressed in the next release of TensorRT. I can’t discuss the release schedule here, but please stay tuned for the announcement.

regards,
NVES

Hello,

Yes, this is the correct output without any error message from TensorRT.

Thanks for your help, I will wait for the next release.

Denis

Hello Moderator,

Can you please share the .prototxt and caffemodel file you used for the inference results you posted on 02/07/2019 10:52 PM.

Thanks!