TensorRT fp16 plugin

Hi,
I noticed that in half mode, before executing a plugin, tensorRT converts the data to float32 and at the end reconvert it to float16.
Is it possible to write a plugin to infer directly the fp16 data without the conversion?

thanks

What advantages do you think would such a solution have?

Since I have a network model like this:
Conv2D
Activation
Conv2D
Activation
Pool
Conv2D
Activation
Conv2D
Activation
ecc

with a custom activation function implemented as a tensorRT plugin, it has to do a lot of conversion to run in half precision mode.
Because of that it gain a lot less speedup than the same network with ReLU activations.

Does the CUDA profiler show that the code is limited by computational throughput?

FP16 computation in intermediate steps only makes sense if you have a GPU with high FP16 throughput. As far as I am aware, there are only two of those at the moment: P100 and V100. If you have one of those, more power to you.

All other GPUs have only rudimentary F16 computation capabilities (or none), so doing intermediate computation in FP32 is the way to go for performance. The conversion overhead FP16/FP32 is often negligible, drowned out by memory traffic overhead. Using FP16 as a storage format helps reduce this memory overhead.

Depending on the specifics of your computation, doing everything in FP16 may also negatively affect accuracy.

Hi,

I am using TensorRT 2.1 for inferencing my Caffe models on GTX 1080 TI with FP16 and INT8 support.
Although I am able to run ./sample_mnist_int8 successfully but when I try to run “bin/giexec --deploy=lenet.prototxt --model=lenet_iter_10000.caffemodel --output=prob --half2=true”, TensorRT returns an error as “Half2 support requested on hardware without native FP16 support, performance will be negatively affected.”

I don’t understand why it runs INT8 and not FP16 as 1080 ti has both features. See logs below:

~/no_backup/d1230/TensorRT-2.1.2/data/mnist> …/…/bin/giexec --deploy=lenet.prototxt --model=lenet_iter_10000.caffemodel --output=prob --half2=false --batch=12

deploy: lenet.prototxt
model: lenet_iter_10000.caffemodel
output: prob
half2
batch: 12
Input “data”: 1x28x28
Output “prob”: 10x1x1
Half2 support requested on hardware without native FP16 support, performance will be negatively affected.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 0.184803 ms.
Average over 10 runs is 0.176173 ms.
Average over 10 runs is 0.172307 ms.
Average over 10 runs is 0.172182 ms.
Average over 10 runs is 0.172362 ms.
Average over 10 runs is 0.170336 ms.
Average over 10 runs is 0.185437 ms.
Average over 10 runs is 0.171155 ms.
Average over 10 runs is 0.169658 ms.
Average over 10 runs is 0.171379 ms.

Thanks !!!