Why tensorrt's performance is poor after adding custom op

Hi, NV experts:
I have a custom op which is not supported by tensorrt
so, I add it as a plugin into tensorrt
I found the whole cost time is improve about 10ms
my test as following:

  1. I remove this custom op from my onnx file, and export it as .plan file through trtexec, and the cost of whole network is about 50ms;
  2. I add this custom op(just cudaMemcpy a little data) into my onnx file, and export it as .plan file through trtexec, and the cost of whole network is about 60ms;
  3. I let my code return directly in the enque function, I found the cost of whole network is still about 60ms, the code like this:
int MyPluginDynamic::enqueue(const nvinfer1::PluginTensorDesc* inputDesc,
                        const nvinfer1::PluginTensorDesc* outputDesc,
                        const void* const* inputs, void* const* outputs,
                        void* workspace, cudaStream_t stream) TRT_NOEXCEPT {
    return 0;   //return directly
}

I don’t know why trt’s performance is poor after I adding a little custom op, I guess:

  1. there are some secrect about trt which I don’t know.
  2. my op import extra overhead which I don’t know;
    So, Is there anyone would like to teach me this secret?

no body would like to help me?