Half2Mode (fast FP16) on TX1 with TensorRT 2.1 doesn't seem to work

dvnl · January 16, 2018, 9:59pm

Hi Everyone,

I’m running inference with FP16 precision on a TX1 with a batch size of 2. However, inference times are exactly two times the inference times with batch size of 1. I tried this previously with TensorRT 1 and inference times were the same for batch sizes 1 and 2 when using FP16, which makes sense to me.

What could I do wrong? I query the fast FP16 capabilities with the platformHasFastFp16() function of the builder object and it says true. Then I set the datatype to kHALF and lastly call setHalf2Mode(true) before generating the engine.

How could I verify that inference in fact runs in Half2Mode? Should engine->getBindingDataType() return kHALF for the input and output bindings? Because it does not, and when I use TensorRT’s profiler interface, I don’t see a layer neither on the in and outputs which would convert to and fro 32/16 bit float. OP of [url]TensorRT FP16 Data Type Conversion - GPU-Accelerated Libraries - NVIDIA Developer Forums mentioned it’s called nchwToNchhw2, should I see this among the layers, or only as a kernel if I run the project with nvprof?

Your help is much appreciated,
Daniel

AastaLLL · January 17, 2018, 3:54am

Hi,

Could you check your model with our native TensorRT sample and share results with us first?

cp -r /usr/src/tensorrt/ .
cd tensorrt/samples/
make
cd ../bin/
./giexec --deploy=/path/to/prototxt --half2 --batch=1 --output=/name/of/output
./giexec --deploy=/path/to/prototxt --half2 --batch=2 --output=/name/of/output

Ex. for googlenet located at $TensorRT_root/data/googlenet/

./giexec --deploy=../data/googlenet/googlenet.prototxt --half2 --batch=1 --output=prob
./giexec --deploy=../data/googlenet/googlenet.prototxt --half2 --batch=2 --output=prob

Thanks

dvnl · January 17, 2018, 10:43am

Hi AastaLLL,

Thanks for getting back to me. Here are the process times for googlenet, using the giexec app. (I skipped parts to avoid being verbose.)

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --half2 --batch=1 --output=prob
Average over 10 runs is 6.96486 ms.

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --half2 --batch=2 --output=prob
Average over 10 runs is 12.463 ms.

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --batch=1 --output=prob
Average over 10 runs is 12.3355 ms.

~$ ./giexec --deploy=data/googlenet/googlenet.prototxt --batch=2 --output=prob
Average over 10 runs is 23.0306 ms.

So, what I expected that the first two would be identical (both around 12 ms), because FP16 can be leveraged only in case of batch sizes larger than 1.
(From the TensorRT User Guide: “TensorRT can use 16-bit instead of 32-bit arithmetic and tensors, but this alone may not deliver significant performance benefits. Half2Mode is an execution mode where internal tensors interleave 16-bits from adjacent pairs of images, and is the fastest mode of operation for batch sizes greater than one.”)
My previous measurements with TensorRT 1 showed what the User Guide implies - no difference between inference times for batch size 1 or 2 in half2 mode.
Did some internal optimizations take place for TensorRT 2.1 which halve inference times even for batch size 1, when one uses half2 mode? (like interleaving adjacent pixels of one image to a 32-bit register?) I can imagine this, because after upgrading to TensorRT 2.1, running the same thing on a TX1 became twice as fast as running it on a Quadro mobile GPU without FastFP16 support, while previously the Quadro was faster.

Thank you,
Daniel

AastaLLL · January 22, 2018, 6:54am

Hi,

We are checking this issue internally.
Will update information with you later.

Thanks.

AastaLLL · January 24, 2018, 2:14am

Hi,

In TensorRT 2 and 3, we vectorize across channels instead of adjacent elements in the batch.
This provides performance improvements on batch 1, but can in some cases reduce performance at batch 2.

Thanks.

dvnl · January 24, 2018, 7:14am

AastaLLL,

Thank you for your answer.

Daniel

moodie · March 13, 2018, 1:38pm

AastaLLL,

Does this require setting half2Mode or is this automatic when datatype is set to kHALF?

AastaLLL · March 16, 2018, 7:21am

Hi,

To run half-precision network, please remember to set parser with DataType::kHALF and rise the setHalf2Mode flag.

const IBlobNameToTensor *blobNameToTensor = parser->parse(locateFile(deployFile).c_str(),
                                                          locateFile(modelFile).c_str(),
                                                          *network,
                                                          <b>DataType::kHALF</b>);
...
builder->setHalf2Mode(<b>true</b>);

Here is a tutorial for 16-bit inference for your reference:
http://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#googlenet_sample

Thanks.

Topic		Replies	Views
TensorRT on TX1 with jetpack 2.3.1 FP16 mode support Jetson TX1	4	684	October 18, 2021
FP16 mode is not running faster than FP32 mode TensorRT	0	938	February 11, 2019
TemsorRT Fp16 mode Jetson TX1	6	1279	October 18, 2021
Which layers of TensorRT will work in fp16 mode when enable the --half2 option? Jetson TX1	2	546	October 18, 2021
which layers of TensorRT will work in fp16 mode when enable the --half2 option? Jetson TX2	1	1009	March 17, 2017
TensorRT Half2 Accuracy Issue Jetson TX1	5	890	October 18, 2021
Time of inference in FP16 and FP32 is the same Jetson TX2 tensorrt	20	1752	August 10, 2022
half float can't accelerate in tensorRT. Jetson TX2	2	634	October 18, 2021
TensorRT ------ maxBatchSize & batchSize ------ kFLOAT & kHALF ------ sampleUffMNIST.cpp Jetson TX2	4	3447	October 18, 2021
FP32 and FP16 imagenet Jetson TX2	3	875	October 18, 2021

Half2Mode (fast FP16) on TX1 with TensorRT 2.1 doesn't seem to work

Related topics