Deep Learning Inference: Performance validation on TX1

NVidia published a whitepaper ( to take the investigation of GPU performance and energy efficiency for deep learning inference.

To achieve high performance, Caffe and cuDNN 4 library are combined on TX1 to provides a series of optimizations for inference.

cuDNN v4 applied in caffe to be optimized inference on small batch size, and improve performance of reduced precision floating point particularly. FP16 arithmetic delivers up to 2x the performance of equivalent FP32 arithmetic, below info will reveal the details to perform performance result based on current release packages on TX1.

* Env-setup:

cuDNN v4, cuda 7.0.73, and r23.1 L4T image packages are required.

  1. Flash release image r23.1 on TX1
  2. Download cudnn v4 armv7 version from
  3. Download cuda 7.0.73 toolkit package from
  4. tool-chain setup
    sudo add-apt-repository universe sudo apt-get update
    $ sudo apt-get install cmake git aptitude screen g++ libboost-all-dev
    libgflags-dev libgoogle-glog-dev protobuf-compiler libprotobuf-dev
    bc libblas-dev libatlas-dev libhdf5-dev libleveldb-dev liblmdb-dev
    libsnappy-dev libatlas-base-dev python-numpy libgflags-dev
    libgoogle-glog-dev python-skimage python-protobuf python-pandas
  5. Clone Caffe: “git clone -b experimental/fp16”
  6. modify Makefile.config
    cd caffe mv Makefile.config.sample Makefile.config
    $ vim Makefile.config
    Line 5: enabled “USE_CUDNN := 1”
    Line 17: enabled “NATIVE_FP16 := 1
    Line 41: inserted “-gencode arch=compute_53,code=sm_53 \”
    Line 42: modified “-gencode arch=compute_50,code=compute_50” to “-gencode arch=compute_53,code=compute_53”
  7. install cuda toolkit
    sudo dpkg -i cuda-repo-l4t-r23.1-7-0-local_7.0-73_armhf.deb sudo apt-get update
    sudo apt-get install cuda-toolkit-7-0 export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH
  8. tar zxvf
    Copy files in “include” dir and “lib” dir into /usr/local/cuda/include/ and /usr/local/cuda/lib/
  9. compile caffe
    cd caffe make

* Burst CPU, GPU and emc to MAX

echo "Set Tegra CPUs to max freq"
echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq
cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq
echo "Disable Tegra CPUs quite and set current gov to runnable"
echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable
echo runnable > /sys/devices/system/cpu/cpuquiet/current_governor
echo "Set Max GPU rate"
echo 844800000 > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
# burst EMC freq to top
echo "Set Max EMC rate"
echo 1 > /sys/kernel/debug/clock/override.emc/state
cat /sys/kernel/debug/clock/emc/max > /sys/kernel/debug/clock/override.emc/rate

* Executed test app
caffe/build/tools/caffe_fp16 time --model=caffe/models/bvlc_alexnet/deploy.prototxt -gpu 0 -iterations 30
caffe/build/tools/caffe time --model= caffe/models/bvlc_alexnet/deploy.prototxt -gpu 0 -iterations 30

* modify batches size

$ vim caffe/models/bvlc_alexnet/deploy.prototxt
modify line 4: “dim: 32” or “dim: 1”

Test result

AlexNet       | batch size 1       | batch size 32 |
fp16          | 73 img/sec         | 250 img/sec   |
fp32          | 48 img/sec         | 162 img/sec   |



I’m benchmarking a custom CNN model using the same environment as detailed above, including running the maxperf script. I just got the fp16 version to produce correct results and may have made a mistake somewhere, but I’m getting quite disappointing runtime performance when using fp16 tensors. About 48 FPS with fp32 as opposed to 42 FPS with fp16.

Here’s an image showing profiler timelines for fp16 and fp32 inference:

Note that the time scales of the two windows are not identical. Total inference time is about 20.4ms for fp32 and 23.4ms for fp16. The reason for fp16 running slower seems to be that multiple kernels are not launched/executed simultaneously to the same extent as when running with fp32 data.

Any ideas as to why this is happening and what I might try to fix it would be great.

Looking closer at the graphs and stats in the visual profiler, my thoughts are that the fp16 slowdown is caused by the precomputed_convolve_sgemm kernel (2nd from top in the tables) using a different launch configuration when running fp16:

fp16: Block Size = [8,32,1], Smem/Block = 10.25KiB, Regs/Thread 128
fp32: Block Size = [8,8,1], Smem/Block = 4.5KiB, Regs/Thread 56

So my guess is that the larger block size and regs/thread of the fp16 launch is preventing this kernel from running in parallel with the implicit_convolve_sgemm kernel (1st in table), which seems to slow things down significantly.

Question is, what can be done about it?

Hello, lars:
To compare performance with fp16 and fp32, it’s better to use same parameters.

For different models/parameters, the performance should be different.


Hi ChenJian,

My fp16 and fp32 test cases are identical, apart from the floating point precision being used. The exact same CNN model and exact same sequence of CUDNN/CUBLAS API calls are being made (with the exception of cublasSgemm vs cublasHgemm for the final, linear layers).

I’m guessing you’re referring to the difference in kernel launch configuration that I’m seeing, but that is being determined by CUDNN and is, as far as I know, out of my control.

Has anyone had issues in getting the same classification of an image from using DIGITS and when running the classifier.bin code? I trained my Alexnet model and then submitted an image for test and it came back correct, When I downloaded the model to the TX1 and ran the classifier.bin executable on the TX1 using the same test image as I did in DIGITS on the host I get the wrong answer. I have a trained mode for cars and no cars and in DIGITS I get 100% no cars for a no car image. When I do the same test on the TX1 using the same test no car image I get 100% cars. What could be the issue here?

Hi !!

I installed it in reference to this article.
“example” of “caffe” is OK!! Very fast.

However, “make runtest” stopped by an error. Do you not have any problem?

In this way

Cuda number of devices: 1
Setting to use device 0
Setting to use device 0
Current device id: 0
Current device name: GM20B
Note: Randomizing tests’ orders with a seed of 36162 .
[==========] Running 2508 tests from 360 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TanHLayerTest/2, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<caffe::float16, float> >
[ RUN ] TanHLayerTest/2.TestTanH
src/caffe/test/test_tanh_layer.cpp:64: Failure
The difference between expected_value and Get(top_data[i]) is 0.00022143125534057617, which exceeds tol(precision), where
expected_value evaluates to 0.77126294374465942,
Get(top_data[i]) evaluates to 0.771484375, and
tol(precision) evaluates to 7.7126293035689741e-05.
src/caffe/test/test_tanh_layer.cpp:64: Failure
The difference between expected_value and Get(top_data[i]) is 0.00010585784912109375, which exceeds tol(precision), where

computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
tol(threshold_) * scale evaluates to 0.0200042724609375.
debug: (top_id, top_data_id, blob_id, feat_id)=0,125,0,422; feat = 1.3046875; objective+ = 2.609375; objective- = 2.609375
[ FAILED ] SPPLayerTest/2.TestGradient, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<caffe::float16, float> > (5579 ms)
[ RUN ] SPPLayerTest/2.TestSetup
[ OK ] SPPLayerTest/2.TestSetup (0 ms)
[ RUN ] SPPLayerTest/2.TestForwardBackward
[ OK ] SPPLayerTest/2.TestForwardBackward (1 ms)
[ RUN ] SPPLayerTest/2.TestEqualOutputDims
[ OK ] SPPLayerTest/2.TestEqualOutputDims (0 ms)
[----------] 5 tests from SPPLayerTest/2 (5581 ms total)

[----------] 6 tests from CuDNNConvolutionLayerTest/2, where TypeParam = caffe::MultiPrecision<caffe::float16, float>
[ RUN ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionCuDNN
[ OK ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionCuDNN (18 ms)
[ RUN ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionGroupCuDNN
[ OK ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionGroupCuDNN (1 ms)
[ RUN ] CuDNNConvolutionLayerTest/2.TestSobelConvolutionCuDNN
F0615 12:46:51.717325 3716 cudnn_conv_layer.cpp:138] Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED
*** Check failure stack trace: ***
@ 0x432a1060 (unknown)
@ 0x432a0f5c (unknown)
@ 0x432a0b78 (unknown)
@ 0x432a2f98 (unknown)
@ 0x43b224a0 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x13bdee caffe::Layer<>::SetUp()
@ 0x3c4350 caffe::CuDNNConvolutionLayerTest_TestSobelConvolutionCuDNN_Test<>::TestBody()
@ 0x4e5788 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x4dfd1a testing::Test::Run()
@ 0x4dfdaa testing::TestInfo::Run()
@ 0x4dfe82 testing::TestCase::Run()
@ 0x4e16da testing::internal::UnitTestImpl::RunAllTests()
@ 0x4e18cc testing::UnitTest::Run()
@ 0x13066a main
@ 0x44302670 (unknown)
make: *** [runtest] Aborted

How about everybody?


Thank you for all the detailled information. I was able to run GoogleNet FP16 and reach the results presented on the white paper!

Now I am a little bit confused about power consumption. I am checking out power consumption at main supply level (the full board). At idle (just logged to the Linux), I get 2W (which is quite low!). But when running the benchmark (googlenet) I get up to 12w ! (@1Ghz). At this frequency I reach 57 FPS (a rough 5 img/s/w) whereas on white paper it says ~5w => 8img/s/w.

I tried different frequencies and always get around 4-5 img/s/W.

I checked that the fan consumption is negligeable (not explaining the gap).

Please, can you provide me some clue about how the power measurement was done ? Do I need to lower the voltage settings (if it is possible to do so) ?


Be sure you measure with any USB device either removed or using an externally powered HUB.

EDIT: I forgot to mention, if you are not using WiFi, disable this as well.


Thank you for your reply. But I do not use WiFi nor have any device connected to the board. Moreover, I suspect that these devices would contribute to my ‘idle’ measure (the benchmark do not generate any activity on devices nor on wifi).

I searched through this forum, and on the datasheet and all says that a ‘fully loaded’ X1 GPU can reach 10~14W which matches my observations.

Therefore, I guess that there are special tweaks/settings required to lower power consuption and reach paper numbers. Or maybe I just missed something!


Just an observation. The AlexNet performance numbers reported here are actually quite close to the ones in the whitepaper (

[this post vs the whitepaper]
48 vs 47 (fp32, batch size 1)
73 vs 67 (fp16, batch size 1)

162 vs 155 (fp32, batch size 32 vs 128)
250 vs 258 (fp16, batch size 32 vs 128)

However, for the whitepaper experiments the frequency was set to 690 MHz, while the experiments reported here the frequency was set to the maximum (998 MHz?)

So the frequency doesn’t seem to play a big role (998 MHz vs 690 MHz), and neither is the large batch size (32 vs 128)…

There are some question:
ubuntu@tegra-ubuntu:~$ sudo apt-get install libboost-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev libatlas-base-dev liblmdb-dev libblas-dev libatlas-base-dev libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler
Reading package lists… Done
Building dependency tree
Reading state information… Done
Package protobuf-compiler is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package libboost-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package libprotobuf-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package libleveldb-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package ‘libboost-dev’ has no installation candidate
E: Package ‘libprotobuf-dev’ has no installation candidate
E: Package ‘libleveldb-dev’ has no installation candidate
E: Unable to locate package libsnappy-dev
E: Package ‘protobuf-compiler’ has no installation candidate
ubuntu@tegra-ubuntu:~$ sudo apt-get install libopencv-dev
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
libopencv-dev : Depends: libopencv-core-dev (=
Depends: libopencv-ml-dev (=
Depends: libopencv-imgproc-dev (=
Depends: libopencv-video-dev (=
Depends: libopencv-objdetect-dev (=
Depends: libopencv-highgui-dev (=
Depends: libopencv-calib3d-dev (=
Depends: libopencv-flann-dev (=
Depends: libopencv-features2d-dev (=
Depends: libopencv-legacy-dev (=
Depends: libopencv-contrib-dev (=
Depends: libopencv-ts-dev (=
Depends: libopencv-photo-dev (=
Depends: libopencv-videostab-dev (=
Depends: libopencv-stitching-dev (=
Depends: libopencv-gpu-dev (=
Depends: libopencv-superres-dev (=
Depends: libopencv-ocl-dev (= but it is not going to be installed
Depends: libopencv2.4-java (= but it is not going to be installed
Depends: libopencv2.4-jni (= but it is not going to be installed
Depends: libcv-dev (=
Depends: libhighgui-dev (=
Depends: libcvaux-dev (=
E: Unable to correct problems, you have held broken packages.
what can be done about these?

Have you run below commands?
sudo add-apt-repository universe sudo add-apt-repository multiverse
$ sudo apt-get update

Have everyone installed the nvcaffe0.16 on TX1 ?

Should I change the CUDA_ARCH part in Makefile.config ?

CUDA_ARCH part in nvcaffe0.16

CUDA_ARCH := -gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61

only insert the -gencode arch=compute_53,code=sm_53 \ ?

Thanks for help !

Hello Jachen,

I’m wondering how did you get the img/sec result?


73 img/sec | 250 img/sec |
48 img/sec | 162 img/sec |

I have Jetson Tx1 development kit opened the box but not used. I had bought for a Project but the Project got cancelled. So I was not able to use this Kit. I am now selling this development kit of Rs 48000/-. (negotiable). Please contact me on email id You guys can call me or mail me for the recent pictures of the kit. It will be helpful if anyone needs it.