I have a 3 module network, let’s say it consists of a backbone, “someOperation” and head. This someOperation module in the middle is not supported by TRT.
Is there a difference in performance between implementing this “someOperation” as a TRT plugin that will be built in the engine and breaking down the NN into two modules and bridge those two modules with plain CUDA?
Please refer to below links related custom plugin implementation and sample:
While IPluginV2 and IPluginV2Ext interfaces are still supported for backward compatibility with TensorRT 5.1 and 6.0.x respectively, however, we recommend that you write new plugins or refactor existing ones to target the IPluginV2DynamicExt or IPluginV2IOExt interfaces instead.
If the application enqueues on the same stream, we don’t think it would be a performance difference if the kernel is the only connection between the first and last module. If there are other connections, then the plugin could be scheduled differently, thus allowing more efficient execution.