getPluginCreator could not find plugin: EfficientNMS_TRT version: 1 error with C++ API but works fine with Python API

Description

Hi,

I am running into an issue with trying the detectron2 Tensorrt sample as mentioned in https://github.com/NVIDIA/TensorRT/tree/release/8.6/samples/python/detectron2

These provided python samples for the onnx model conversion, TensorRT engine building and optimized engine inference works normally as expected using the TensorRT Python API. I am able to run inference and visualize the results on images.

However, I am writing a C++ inference app that uses these generated engine or onnx files. When I try to load the same .trt engine or parse the .onnx file using the TensorRT C++ API, I am running into plugin errors for EfficientNMS_TRT (which didn’t show up when the same engine was loaded via the python script). Below snippet shows an example of the error I am seeing:

[TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[TRT] [I] No importer registered for op: EfficientNMS_TRT. Attempting to import as plugin.
[TRT] [I] Searching for plugin: EfficientNMS_TRT, plugin_version: 1, plugin_namespace: 
[TRT] [E] 3: getPluginCreator could not find plugin: EfficientNMS_TRT version: 1
[TRT] [E] ModelImporter.cpp:771: While parsing node number 254 [EfficientNMS_TRT -> "num_detections_rpn"]:
[TRT] [E] ModelImporter.cpp:772: --- Begin node ---
[TRT] [E] ModelImporter.cpp:773: input: "anchors_3"
input: "scores_unsqueeze:0_2"
input: "default_anchors"
output: "num_detections_rpn"
output: "detection_boxes_rpn"
output: "detection_scores_rpn"
output: "detection_classes_rpn"
name: "nms_rpn"
op_type: "EfficientNMS_TRT"
attribute {
  name: "plugin_version"
  s: "1"
  type: STRING
}
attribute {
  name: "background_class"
  i: -1
  type: INT
}
attribute {
  name: "max_output_boxes"
  i: 1000
  type: INT
}
attribute {
  name: "score_threshold"
  f: 0.01
  type: FLOAT
}
attribute {
  name: "iou_threshold"
  f: 0.7
  type: FLOAT
}
attribute {
  name: "score_activation"
  i: 0
  type: INT
}
attribute {
  name: "class_agnostic"
  i: 0
  type: INT
}
attribute {
  name: "box_coding"
  i: 1
  type: INT
}

[TRT] [E] ModelImporter.cpp:774: --- End node ---
[TRT] [E] ModelImporter.cpp:777: ERROR: builtin_op_importers.cpp:5404 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

What is surprising is that same plugin (EfficientNMS_TRT) works fine with the TensorRT python API but causes this error with the C++ API. I am not sure why its not able to link/find that plugin with C++ API.

I ensured I am calling the initLibNvInferPlugins(&trt_logger_, ""); to register various plugins. I print some debug info of the registered plugins after this function call using this code snippet:

  // Temporarily print the registered plugins
  int numCreators = 0;
  nvinfer1::IPluginCreator* const* tmpList = getPluginRegistry()->getPluginCreatorList(&numCreators);
  for (int k = 0; k < numCreators; ++k)
  {
      if (!tmpList[k])
      {
          std::cout << "Plugin Creator for plugin " << k << " is a nullptr." << std::endl;
          continue;
      }
      std::string pluginName = tmpList[k]->getPluginName();
      std::cout << k << ": " << pluginName << std::endl;
  }

and I do see the EfficientNMS_TRT being printed as seen below from the console output

0: RNNTEncoderPlugin
1: SmallTileGEMM_TRT
2: DLRM_BOTTOM_MLP_TRT
3: CustomQKVToContextPluginDynamic
4: CustomQKVToContextPluginDynamic
5: CustomQKVToContextPluginDynamic
6: CustomSkipLayerNormPluginDynamic
7: CustomSkipLayerNormPluginDynamic
8: CustomSkipLayerNormPluginDynamic
9: CustomSkipLayerNormPluginDynamic
10: SingleStepLSTMPlugin
11: RnRes2FullFusion_TRT
12: RnRes2Br2bBr2c_TRT
13: RnRes2Br2bBr2c_TRT
14: RnRes2Br1Br2c_TRT
15: RnRes2Br1Br2c_TRT
16: GroupNormalizationPlugin
17: CustomGeluPluginDynamic
18: CustomFCPluginDynamic
19: CustomEmbLayerNormPluginDynamic
20: CustomEmbLayerNormPluginDynamic
21: CustomEmbLayerNormPluginDynamic
22: DisentangledAttention_TRT
23: BatchedNMSDynamic_TRT
24: BatchedNMS_TRT
25: BatchTilePlugin_TRT
26: Clip_TRT
27: CoordConvAC
28: CropAndResizeDynamic
29: CropAndResize
30: DecodeBbox3DPlugin
31: DetectionLayer_TRT
32: EfficientNMS_Explicit_TF_TRT
33: EfficientNMS_Implicit_TF_TRT
34: EfficientNMS_ONNX_TRT
35: EfficientNMS_TRT
36: FlattenConcat_TRT
37: GenerateDetection_TRT
38: GridAnchor_TRT
39: GridAnchorRect_TRT
40: InstanceNormalization_TRT
41: InstanceNormalization_TRT
42: LReLU_TRT
43: ModulatedDeformConv2d
44: MultilevelCropAndResize_TRT
45: MultilevelProposeROI_TRT
46: MultiscaleDeformableAttnPlugin_TRT
47: NMSDynamic_TRT
48: NMS_TRT
49: Normalize_TRT
50: PillarScatterPlugin
51: PriorBox_TRT
52: ProposalDynamic
53: ProposalLayer_TRT
54: Proposal
55: PyramidROIAlign_TRT
56: Region_TRT
57: Reorg_TRT
58: ResizeNearest_TRT
59: ROIAlign_TRT
60: RPROI_TRT
61: ScatterND
62: SpecialSlice_TRT
63: Split
64: VoxelGeneratorPlugin

Please provide any suggestions or fixes to this as I have tried searching through similar issues on this forum but have not had success with any of the solutions. Thank you.

Environment

TensorRT Version: 8.6.1 GA
GPU Type: RTX 3050 Ti
Nvidia Driver Version: 530.30.02
CUDA Version: 11.7
CUDNN Version: 8.8.0.121
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.13.1
Baremetal or Container (if container which image + tag):

1 Like

I am running into the same issue and upon directly deserializing the generated engine file. I am running into an error like [TRT] [E] 1: [dispatchStubs.cpp::deserializeEngine::14] Error Code 1: Internal Error (Unexpected call to stub)

Following is the detailed log from the terminal output:

[TRT] [I] Loaded engine size: 90 MiB
[TRT] [E] 1: [dispatchStubs.cpp::deserializeEngine::14] Error Code 1: Internal Error (Unexpected call to stub)
Failed to deserialize the loaded TensorRT engine

Hi @rajanand ,
Can you pls try running via trtexec and see if the same issue is happening?
trtexec --loadEngine=model.plan

Thanks

Hi @AakankshaS,
Yes, I had previously tested it running with trtexec and it seems to work fine with it, which is why I am a bit confused as to why it throws the plugin error when I use the C++ API in my code.

Here is the example command I ran with and console output:
trtexec --loadEngine=engine.trt --useCudaGraph --noDataTransfers --iterations=1500 --avgRuns=1500
and it shows as PASSED with printing the performance summary

[11/15/2023-06:40:42] [I] === Device Information ===
[11/15/2023-06:40:42] [I] Selected Device: NVIDIA GeForce RTX 3050 Ti Laptop GPU
[11/15/2023-06:40:42] [I] Compute Capability: 8.6
[11/15/2023-06:40:42] [I] SMs: 20
[11/15/2023-06:40:42] [I] Device Global Memory: 3904 MiB
[11/15/2023-06:40:42] [I] Shared Memory per SM: 100 KiB
[11/15/2023-06:40:42] [I] Memory Bus Width: 128 bits (ECC disabled)
[11/15/2023-06:40:42] [I] Application Compute Clock Rate: 1.035 GHz
[11/15/2023-06:40:42] [I] Application Memory Clock Rate: 5.501 GHz
[11/15/2023-06:40:42] [I] 
[11/15/2023-06:40:42] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[11/15/2023-06:40:42] [I] 
[11/15/2023-06:40:42] [I] TensorRT version: 8.6.1
[11/15/2023-06:40:42] [I] Loading standard plugins
[11/15/2023-06:40:43] [I] Engine loaded in 0.126469 sec.
[11/15/2023-06:40:43] [I] [TRT] Loaded engine size: 89 MiB
[11/15/2023-06:40:45] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1917, GPU +372, now: CPU 2550, GPU 1050 (MiB)
[11/15/2023-06:40:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1445, GPU +407, now: CPU 3995, GPU 1457 (MiB)
[11/15/2023-06:40:46] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.5.0
[11/15/2023-06:40:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +87, now: CPU 0, GPU 87 (MiB)
[11/15/2023-06:40:46] [I] Engine deserialized in 3.58359 sec.
[11/15/2023-06:40:46] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4039, GPU 1456 (MiB)
[11/15/2023-06:40:46] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4039, GPU 1464 (MiB)
[11/15/2023-06:40:46] [W] [TRT] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.5.0
[11/15/2023-06:40:46] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +268, now: CPU 0, GPU 355 (MiB)
[11/15/2023-06:40:46] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[11/15/2023-06:40:46] [I] Setting persistentCacheLimit to 0 bytes.
[11/15/2023-06:40:46] [I] Using random values for input input_tensor
[11/15/2023-06:40:46] [I] Input binding for input_tensor with dimensions 1x3x1344x1344 is created.
[11/15/2023-06:40:46] [I] Output binding for num_detections_box_outputs with dimensions 1x1 is created.
[11/15/2023-06:40:46] [I] Output binding for detection_boxes_box_outputs with dimensions 1x100x4 is created.
[11/15/2023-06:40:46] [I] Output binding for detection_scores_box_outputs with dimensions 1x100 is created.
[11/15/2023-06:40:46] [I] Output binding for detection_classes_box_outputs with dimensions 1x100 is created.
[11/15/2023-06:40:46] [I] Output binding for detection_masks with dimensions 1x100x28x28 is created.
[11/15/2023-06:40:46] [I] Starting inference
[11/15/2023-06:44:34] [I] Warmup completed 3 queries over 200 ms
[11/15/2023-06:44:34] [I] Timing trace has 1500 queries over 78.899 s
[11/15/2023-06:44:34] [I] 
[11/15/2023-06:44:34] [I] === Trace details ===
[11/15/2023-06:44:34] [I] Trace averages of 1500 runs:
[11/15/2023-06:44:34] [I] Average on 1500 runs - GPU latency: 52.5972 ms - Host latency: 52.5972 ms (enqueue 0.197631 ms)
[11/15/2023-06:44:34] [I] 
[11/15/2023-06:44:34] [I] === Performance summary ===
[11/15/2023-06:44:34] [I] Throughput: 19.0116 qps
[11/15/2023-06:44:34] [I] Latency: min = 51.5604 ms, max = 53.8896 ms, mean = 52.5971 ms, median = 52.5859 ms, percentile(90%) = 52.9336 ms, percentile(95%) = 53.0234 ms, percentile(99%) = 53.3281 ms
[11/15/2023-06:44:34] [I] Enqueue Time: min = 0.0234375 ms, max = 0.483398 ms, mean = 0.197631 ms, median = 0.201172 ms, percentile(90%) = 0.265625 ms, percentile(95%) = 0.273438 ms, percentile(99%) = 0.324219 ms
[11/15/2023-06:44:34] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
[11/15/2023-06:44:34] [I] GPU Compute Time: min = 51.5604 ms, max = 53.8896 ms, mean = 52.5971 ms, median = 52.5859 ms, percentile(90%) = 52.9336 ms, percentile(95%) = 53.0234 ms, percentile(99%) = 53.3281 ms
[11/15/2023-06:44:34] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
[11/15/2023-06:44:34] [I] Total Host Walltime: 78.899 s
[11/15/2023-06:44:34] [I] Total GPU Compute Time: 78.8956 s
[11/15/2023-06:44:34] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/15/2023-06:44:34] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8601]

I am unsure how you are doing this but even I had some kind of error related to no plugin when I was trying to create engine in c++.
Obviously no error when using trtexec so I realised that trtexec does the plugin thing by itself.

You need to do that same stuff by yourself in c++.

Its just one line.

import these

#include <NvInferPlugin.h>
#include <NvInferPluginUtils.h>

and use this line:
initLibNvInferPlugins(&m_logger, "");

Hi @rajupadhyay59, thanks for the suggestion. I already had that line in my code to the initialize the plugins to use them from “” namespace. But the above error still shows up.
The reason this is was confusing is as I showed in the snippet in the original post, that I do see the plugin name show up when In print it out (e.g. EfficientNMS_TRT) whcih leads to me believe it did get registered with my app, but still on runtime upon parsing the engine the plugin error happens.

For me, once I used those lines, my cpp code would create the engine,

I referred this github repo

This is for yolo but i made changes for detectron2. The code itself does not change a lot except for parameters and stuff, insert the plugin line too and it may work.

Hi @rajupadhyay59,
As mentioned above, I did have the init plugin line inserted in my code and the error still showed up. Can you share any snippet with the changes you made for detectron2?

Try this out, although it has been long since I have used them, I think these might work.
Please make changes to Makefile accordingly though.

scripts.zip (14.5 KB)

1 Like

Thank you very much @rajupadhyay59, will check it out. Just to confirm, these scripts are the ones that worked with detectron2 for you?

yes, they did. try it for fp16 though.

1 Like

Hi @rajupadhyay59 ,
thank you so much for your responses. If its not anything proprietary, would you also be able to provide the detectron2 onnx model weights you used with the scripts to build the engine and run inference?

I do have it with me now.
You can use the normal mask-rcnn detectron2 weights too and generate an onnx file using that.
Good luck.

Hi @rajupadhyay59 ,

Thank you so much for your responses. I tried running your scripts with a mask-rcnn detectron2 onnx model and use its corresponding converted.onnx that was generated for TensorRT graph compatibility with the EfficientNMS_TRT plugin as noted here in the TensorRT github readme - TensorRT/samples/python/detectron2 at release/8.6 · NVIDIA/TensorRT · GitHub

I still get the same error as i reported about not able to find plugin: EfificientNMS_TRT. Please see below snippet on the console output from the same scripts you shared above:

Searching for engine file with name: converted.engine.NVIDIAGeForceRTX3050TiLaptopGPU.fp16.1.1
Engine not found, generating. This could take a while...
CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
3: getPluginCreator could not find plugin: EfficientNMS_TRT version: 1
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unable to build TRT engine.

Do you have any suggestions on what could be going wrong here?

I can’t help much there. If your model can be generated using trtexec then I really do not know the issue you are facing tbh. I had the same issue and I solved it using the script I sent you.

How about you first try to run all this in a docker image provided by Nvidia? Any docker image is fine, deepstream docker image too.

Good luck.