TensorRT merges wrongly two different layers?

https://gist.github.com/blacksailer/ba795610cedca5747271da6698b7b994#file-tensorrt-log-L102Hello!

I’m porting CenterNet https://github.com/xingyizhou/CenterNet to TensorRT. Im using C++ API and implementing plugin for deformable convolution layer.

On last layer I’ve got

../builder/cudnnBuilderGraph.cpp (660) - Assertion Error in checkSanity: 0 (tensors.size() == g.tensors.size())

I thought that it might be shape mismatch, but on defining network it doesn’t assert, only on engine building.

This error shows up on conv_offset_mask. TensorRT merges two layers, but they have different weights though have same shapes. In PyTorch it corresponds to this layer https://github.com/xingyizhou/CenterNet/blob/master/src/lib/models/networks/pose_dla_dcn.py#L477

Can I send project so you can check what is wrong? (I can’t publicly publish code)

Graphical cards on which was tested:

  • GeForce 1050Ti (Ubuntu 16.04, CUDA 10.1, TensorRT 6.0.1.5)
  • Tesla K80 (CentOS7, CUDA 10.1, TensorRT 6.0.1.5)
  • Tesla P100 (Ubuntu 16.04, CUDA 10.0, TensorRT 6.0.1.5)

UPD:
Here is log - https://gist.github.com/blacksailer/ba795610cedca5747271da6698b7b994

This line shows that two layers are merged, but they have diffrent parameters - https://gist.github.com/blacksailer/ba795610cedca5747271da6698b7b994#file-tensorrt-log-L102

Hi,

Can you please share the script & model file to reproduce the issue?

Thanks

I meet the same error.In my project,if I use only plugin for deformable convolution layer,it can get correct result, but when I run centernet,I get the same error.
I can share my onnx file and plugin code for you.

@uestchanyan Hi! Can you post your code? So I can also check this error

https://zhuanlan.zhihu.com/p/84125533, if you read Chinese, it says it is caused by Slice layer in TensorRT. I have not confirm it yet.
It is very strange that when I build a dummy model which contain only a few conv layers with a modulated deform conv inserted in the middle, I can build the model and do inference. But when I do the same procedures with centernet, it failed with the same assertion mentioned above.
BTW, have you tried a simple model with modulated deform conv before?

No, never tried.

Currently I have three engines because of some strange behavior of TensorRT for CenterNet to speed up inference and got 30fps. But I want to have one engine.

Hi,
When I use a small model to debug my modulated deform conv plugin, it works fine. But when I apply the layer to centernet it shows the same error message as stated above. How can I give you my code and centernet onnx model to you to reproduce the error?
Thanks.
P.S. @cheivan I would like to share it with you as well for discussion.

Can you post it on github and share link here?

Here it is:


@SunilJB @cheivan

@1051323399 Could you please provide your updates in CMakeList.txt files as references for us so that we can compile and run your plugin with pytorch dependencies?

Sure, the repo has been updated. Thanks. @ersheng

@xmpeng A macro named CHECK_PLUGIN_STATUS is referred multiple times but its definition is missing in the plugin source code.
Please provide us its definition so that we can continue.
Thank you.

@ersheng I renamed the CHECK macro defined in plugin.h (TensorRT/plugin/common) due to its conflict with the one defined in libtorch.

#define CHECK_PLUGIN_STATUS(status)                                                                                                  \
    do                                                                                                                 \
    {                                                                                                                  \
        if (status != 0)                                                                                               \
            abort();                                                                                                   \
    } while (0)

Thanks for your attention.

Hello, @xmpeng

Sorry for late response. We have verified both TensorRT6.0 and TensorRT7.0 with your plugin, and there are some conclusions from us:

  1. This problem does not seem to be related to your plugin (ModulatedDeformConv).
  2. This is a graph building error. There are structures such as parallel conv layers followed by split operations that TensorRT6.0 cannot handle properly.
  3. TensorRT7.0 can handle this kind of structures without errors.

So, we recommend you to upgrade to TensorRT7.0 if possible.

Thanks a lot for your clear explanations.
Hopefully TensorRT7 will support CUDA10.1 soon.
Regards.

In June, TensorRT 7.1 should be able to support CUDA 10.1

Hi @ersheng , I’m meeting the same problem and could you please elaborate on why the current plugin does not work with TRT6? Is it due to Split layer, or due to running parallel convolutions before it? Would there be any other work around (for example writing own Split custom layer)?

Due to our deployment environment, only trt <= 6 could be used. Your help is much appreciated.