Plugin layer is much slower than caffe

Hi,

I am porting a caffe upsample layer (add efficient upsample layer by twmht · Pull Request #6384 · BVLC/caffe · GitHub) to a tensorRT plugin.

However, the speed is very slow as compared to caffe.

most of gpu code is identical with caffe.

When upsample input is 256 * 40 * 52 (CHW), the processed time is 19.116ms in TensorRT.

However, the processed time with the same input volume in caffe is only 0.03ms.

this is a huge speed gap.

Any idea?