xavier ternsorrt mnist fp16 is slower the fp32?

i change ternsorrt sampleUffMNIST for test fp32 and fp16 , code like this

if(gUseFp16){
std::cout << “use fp16” << std::endl;
if (!parser->parse(uffFile, *network, nvinfer1::DataType::kHALF))
RETURN_AND_LOG(nullptr, ERROR, “Fail to parse”);
builder->setFp16Mode(true);
} else {
std::cout << “use fp32” << std::endl;
if (!parser->parse(uffFile, *network, nvinfer1::DataType::kFLOAT))
RETURN_AND_LOG(nullptr, ERROR, “Fail to parse”);
}

i get result output:

./sample_uff_mnist --fp16

…/data/mnist/lenet5.uff
use fp16
run[0] use 0.62032 ms.
run[1] use 0.5256 ms.
run[2] use 0.632768 ms.
run[3] use 0.579488 ms.
run[4] use 0.59328 ms.
run[5] use 0.541376 ms.
run[6] use 0.582144 ms.
run[7] use 0.58704 ms.
run[8] use 0.58224 ms.
run[9] use 0.56464 ms.
Average over 10 runs is 0.58089 ms.

./sample_uff_mnist

…/data/mnist/lenet5.uff
use fp32
run[0] use 0.5016 ms.
run[1] use 0.55312 ms.
run[2] use 0.463584 ms.
run[3] use 0.35792 ms.
run[4] use 0.399872 ms.
run[5] use 0.5304 ms.
run[6] use 0.383392 ms.
run[7] use 0.531232 ms.
run[8] use 0.45376 ms.
run[9] use 0.44672 ms.
Average over 10 runs is 0.46216 ms.

Hi,

Have you maximized the device performance before profiling?

sudo nvpmodel -m 0
sudo jetson_clocks

Thanks.

yes, i had maximize the device performance

Hi,

Which JetPack version do you use?
If you haven’t try v4.2.1, would you mind to give it a try first?

We also try to reproduce this issue internally.
Will update more information with you once we find anything.

Thanks.,

I use jetpack-4.2,

head -n 1 /etc/nv_tegra_release

R31 (release), REVISION: 1.0, GCID: 13194883, BOARD: t186ref, EABI: aarch64, DATE: Wed Oct 31 22:26:16 UTC 2018

I get deb package from https://developer.nvidia.com/assets/embedded/secure/tools/files/jetpack-sdks/jetpack-4.2/JETPACK_42_b158/P2888/, and install them

libcudnn7_7.3.1.28-1+cuda10.0_arm64.deb
libcudnn7-dev_7.3.1.28-1+cuda10.0_arm64.deb
libcudnn7-doc_7.3.1.28-1+cuda10.0_arm64.deb
libnvinfer5_5.0.6-1+cuda10.0_arm64.deb
libnvinfer-dev_5.0.6-1+cuda10.0_arm64.deb
libnvinfer-samples_5.0.6-1+cuda10.0_all.deb
tensorrt_5.0.6.3-1+cuda10.0_arm64.deb

Hi,

We already pass this issue to our internal team.
Will update information with you once we got any feedback.

By the way, we reproduce this issue with JetPack4.2.1 and get a much better performance than yours.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks
fp32: Average over 10 runs is 0.288379 ms
fp16: Average over 10 runs is 0.371939 ms

It’s worthy to give JetPack4.2.1 a try.
Thanks.

thanks

i reflash the system use JetPack4.2.1 and run the cmd jetson_clocks and nvpmodel -m 0

head -n 1 /etc/nv_tegra_release

R32 (release), REVISION: 2.0, GCID: 15966166, BOARD: t186ref, EABI: aarch64, DATE: Wed Jul 17 00:26:04 UTC 2019

nvpmodel -q –-verbose

NV Fan Mode:quiet
NV Power Mode: MAXN
0

i comment some print and add each context->execute time’ print, test result like these,

root@nvidia-desktop:/opt/project/nvidia/xavier/tensorrt/bin# ./sample_uff_mnist
&&&& RUNNING TensorRT.sample_uff_mnist # ./sample_uff_mnist
[I] …/data/mnist/lenet5.uff
[I] runInFp16=[0] runInInt8=[0]
[I] run[0] use 0.814376 ms.
[I] run[1] use 0.531791 ms.
[I] run[2] use 0.412644 ms.
[I] run[3] use 0.344734 ms.
[I] run[4] use 0.398851 ms.
[I] run[5] use 0.397379 ms.
[I] run[6] use 0.362752 ms.
[I] run[7] use 0.328668 ms.
[I] run[8] use 0.218451 ms.
[I] run[9] use 0.424869 ms.
[I] Average over 10 runs is 0.423451 ms.
&&&& FAILED TensorRT.sample_uff_mnist # ./sample_uff_mnist
root@nvidia-desktop:/opt/project/nvidia/xavier/tensorrt/bin# ./sample_uff_mnist --fp16
&&&& RUNNING TensorRT.sample_uff_mnist # ./sample_uff_mnist --fp16
[I] …/data/mnist/lenet5.uff
[I] runInFp16=[1] runInInt8=[0]
[I] run[0] use 0.932015 ms.
[I] run[1] use 0.56856 ms.
[I] run[2] use 0.553006 ms.
[I] run[3] use 0.425444 ms.
[I] run[4] use 0.449958 ms.
[I] run[5] use 0.487913 ms.
[I] run[6] use 0.527181 ms.
[I] run[7] use 0.505802 ms.
[I] run[8] use 0.4746 ms.
[I] run[9] use 0.516428 ms.
[I] Average over 10 runs is 0.544091 ms.
&&&& FAILED TensorRT.sample_uff_mnist # ./sample_uff_mnist --fp16
root@nvidia-desktop:/opt/project/nvidia/xavier/tensorrt/bin# ./sample_uff_mnist --int8
&&&& RUNNING TensorRT.sample_uff_mnist # ./sample_uff_mnist --int8
[I] …/data/mnist/lenet5.uff
[I] runInFp16=[0] runInInt8=[1]
[W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32.
[I] run[0] use 0.762366 ms.
[I] run[1] use 0.495273 ms.
[I] run[2] use 0.409698 ms.
[I] run[3] use 0.388992 ms.
[I] run[4] use 0.418018 ms.
[I] run[5] use 0.3824 ms.
[I] run[6] use 0.438916 ms.
[I] run[7] use 0.421731 ms.
[I] run[8] use 0.420898 ms.
[I] run[9] use 0.417058 ms.
[I] Average over 10 runs is 0.455535 ms.
&&&& FAILED TensorRT.sample_uff_mnist # ./sample_uff_mnist --int8

Hi,

The order to execute the command leads to different behavior.
Please set the model into efficiency first and then lock the frequency to the maximal.

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

We are still checking the reason why fp16 run slower than the fp32 mode.
Will update more information here once we find something.

Thanks.

ok, thanks

Hi,

Sorry for keeping you waiting.
It’s recommended to use trtexec for the performance analysis rathan than sample_uff.

Thanks.