I am using TRT 4 on jetson tx2 and when I run my code on the kit, it calls a kernel void cuint8::nchwToNchhw2, which is called before and after each convolutional layer. this happens when I pass weights either in KFLAOT or kHALF, in both cases. This kernel takes a lot of time and this decreases my FPS. What does it do and how can I avoid calling this kernel.
I also have my kernels modified for working on __half so that fp16 can be fully utilized.
My input is in the format of NCHW and kHALF.