I’m running inference with FP16 precision on a TX1 with a batch size of 2. However, inference times are exactly two times the inference times with batch size of 1. I tried this previously with TensorRT 1 and inference times were the same for batch sizes 1 and 2 when using FP16, which makes sense to me.
What could I do wrong? I query the fast FP16 capabilities with the platformHasFastFp16() function of the builder object and it says true. Then I set the datatype to kHALF and lastly call setHalf2Mode(true) before generating the engine.
How could I verify that inference in fact runs in Half2Mode? Should engine->getBindingDataType() return kHALF for the input and output bindings? Because it does not, and when I use TensorRT’s profiler interface, I don’t see a layer neither on the in and outputs which would convert to and fro 32/16 bit float. OP of https://devtalk.nvidia.com/default/topic/1028136/tensorrt-fp16-data-type-conversion/?offset=1 mentioned it’s called nchwToNchhw2, should I see this among the layers, or only as a kernel if I run the project with nvprof?
Your help is much appreciated,