Half2Mode (fast FP16) on TX1 with TensorRT 2.1 doesn't seem to work

Hi Everyone,

I’m running inference with FP16 precision on a TX1 with a batch size of 2. However, inference times are exactly two times the inference times with batch size of 1. I tried this previously with TensorRT 1 and inference times were the same for batch sizes 1 and 2 when using FP16, which makes sense to me.

What could I do wrong? I query the fast FP16 capabilities with the platformHasFastFp16() function of the builder object and it says true. Then I set the datatype to kHALF and lastly call setHalf2Mode(true) before generating the engine.

How could I verify that inference in fact runs in Half2Mode? Should engine->getBindingDataType() return kHALF for the input and output bindings? Because it does not, and when I use TensorRT’s profiler interface, I don’t see a layer neither on the in and outputs which would convert to and fro 32/16 bit float. OP of https://devtalk.nvidia.com/default/topic/1028136/tensorrt-fp16-data-type-conversion/?offset=1 mentioned it’s called nchwToNchhw2, should I see this among the layers, or only as a kernel if I run the project with nvprof?

Your help is much appreciated,


Could you check your model with our native TensorRT sample and share results with us first?

cp -r /usr/src/tensorrt/ .
cd tensorrt/samples/
cd ../bin/
./giexec --deploy=/path/to/prototxt --half2 --batch=1 --output=/name/of/output
./giexec --deploy=/path/to/prototxt --half2 --batch=2 --output=/name/of/output

Ex. for googlenet located at $TensorRT_root/data/googlenet/

./giexec --deploy=../data/googlenet/googlenet.prototxt --half2 --batch=1 --output=prob
./giexec --deploy=../data/googlenet/googlenet.prototxt --half2 --batch=2 --output=prob


Hi AastaLLL,

Thanks for getting back to me. Here are the process times for googlenet, using the giexec app. (I skipped parts to avoid being verbose.)

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --half2 --batch=1 --output=prob
Average over 10 runs is 6.96486 ms.

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --half2 --batch=2 --output=prob
Average over 10 runs is 12.463 ms.

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --batch=1 --output=prob
Average over 10 runs is 12.3355 ms.

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --batch=2 --output=prob
Average over 10 runs is 23.0306 ms.

So, what I expected that the first two would be identical (both around 12 ms), because FP16 can be leveraged only in case of batch sizes larger than 1.
(From the TensorRT User Guide: “TensorRT can use 16-bit instead of 32-bit arithmetic and tensors, but this alone may not deliver significant performance benefits. Half2Mode is an execution mode where internal tensors interleave 16-bits from adjacent pairs of images, and is the fastest mode of operation for batch sizes greater than one.”)
My previous measurements with TensorRT 1 showed what the User Guide implies - no difference between inference times for batch size 1 or 2 in half2 mode.
Did some internal optimizations take place for TensorRT 2.1 which halve inference times even for batch size 1, when one uses half2 mode? (like interleaving adjacent pixels of one image to a 32-bit register?) I can imagine this, because after upgrading to TensorRT 2.1, running the same thing on a TX1 became twice as fast as running it on a Quadro mobile GPU without FastFP16 support, while previously the Quadro was faster.

Thank you,


We are checking this issue internally.
Will update information with you later.



In TensorRT 2 and 3, we vectorize across channels instead of adjacent elements in the batch.
This provides performance improvements on batch 1, but can in some cases reduce performance at batch 2.



Thank you for your answer.

  • Daniel


Does this require setting half2Mode or is this automatic when datatype is set to kHALF?


To run half-precision network, please remember to set parser with DataType::kHALF and rise the setHalf2Mode flag.

const IBlobNameToTensor *blobNameToTensor = parser->parse(locateFile(deployFile).c_str(),

Here is a tutorial for 16-bit inference for your reference: