I am deploying a model on TensorRT 2.1 in the Jetson TX2 with FP16 mode. In the profiler, I see that the engine generated always starts with an FP32 to FP16 format conversion kernel “nchwToNchhw2” which takes about 3ms per frame.
Is there any way of skipping this conversion kernel and have the engine work directly on a half2 input tensor? I ask because shaving off even just these 3ms would correspond to a speedup of almost 15% for my application. (And I suspect that I can do the FP16 conversion myself in a single step together with a custom color conversion CUDA kernel that already runs before the TensorRT model).
If this is actually possible, my obvious next question would be what are the specs for this “nchhw2” format, is it just a normal tensor packed as NCHW with a half2 data type, or are there any other changes happening?
Any information on this would be greatly appreciated! :)