Deep Learning Inference: Performance validation on TX1

jachen · May 11, 2016, 5:57am

NVidia published a whitepaper （http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf） to take the investigation of GPU performance and energy efficiency for deep learning inference.

To achieve high performance, Caffe and cuDNN 4 library are combined on TX1 to provides a series of optimizations for inference.

cuDNN v4 applied in caffe to be optimized inference on small batch size, and improve performance of reduced precision floating point particularly. FP16 arithmetic delivers up to 2x the performance of equivalent FP32 arithmetic, below info will reveal the details to perform performance result based on current release packages on TX1.

* Env-setup:

cuDNN v4, cuda 7.0.73, and r23.1 L4T image packages are required.

Flash release image r23.1 on TX1
Download cudnn v4 armv7 version from https://developer.nvidia.com/cudnn
Download cuda 7.0.73 toolkit package from http://developer.nvidia.com/embedded/dlc/cuda-7-toolkit-l4t-23-2
tool-chain setup
$ sudo add-apt-repository universe
$ sudo apt-get update
$ sudo apt-get install cmake git aptitude screen g++ libboost-all-dev
libgflags-dev libgoogle-glog-dev protobuf-compiler libprotobuf-dev
bc libblas-dev libatlas-dev libhdf5-dev libleveldb-dev liblmdb-dev
libsnappy-dev libatlas-base-dev python-numpy libgflags-dev
libgoogle-glog-dev python-skimage python-protobuf python-pandas
libopencv-dev
Clone Caffe: “git clone https://github.com/NVIDIA/caffe.git -b experimental/fp16”
modify Makefile.config
$ cd caffe
$ mv Makefile.config.sample Makefile.config
$ vim Makefile.config
Line 5: enabled “USE_CUDNN := 1”
Line 17: enabled “NATIVE_FP16 := 1
Line 41: inserted “-gencode arch=compute_53,code=sm_53 \”
Line 42: modified “-gencode arch=compute_50,code=compute_50” to “-gencode arch=compute_53,code=compute_53”
install cuda toolkit
$ sudo dpkg -i cuda-repo-l4t-r23.1-7-0-local_7.0-73_armhf.deb
$ sudo apt-get update
$ sudo apt-get install cuda-toolkit-7-0
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH
tar zxvf
Copy files in “include” dir and “lib” dir into /usr/local/cuda/include/ and /usr/local/cuda/lib/
compile caffe
$ cd caffe
$ make

* Burst CPU, GPU and emc to MAX

echo "Set Tegra CPUs to max freq"
echo userspace > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo userspace > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq
cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq
cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq > /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq
echo "Disable Tegra CPUs quite and set current gov to runnable"
echo 0 > /sys/devices/system/cpu/cpuquiet/tegra_cpuquiet/enable
echo runnable > /sys/devices/system/cpu/cpuquiet/current_governor
echo "Set Max GPU rate"
echo 844800000 > /sys/kernel/debug/clock/override.gbus/rate
echo 1 > /sys/kernel/debug/clock/override.gbus/state
# burst EMC freq to top
echo "Set Max EMC rate"
echo 1 > /sys/kernel/debug/clock/override.emc/state
cat /sys/kernel/debug/clock/emc/max > /sys/kernel/debug/clock/override.emc/rate

* Executed test app
FP16
caffe/build/tools/caffe_fp16 time --model=caffe/models/bvlc_alexnet/deploy.prototxt -gpu 0 -iterations 30
FP32
caffe/build/tools/caffe time --model= caffe/models/bvlc_alexnet/deploy.prototxt -gpu 0 -iterations 30

* modify batches size

$ vim caffe/models/bvlc_alexnet/deploy.prototxt
modify line 4: “dim: 32” or “dim: 1”

Test result

___________________________________________________
AlexNet       | batch size 1       | batch size 32 |
fp16          | 73 img/sec         | 250 img/sec   |
fp32          | 48 img/sec         | 162 img/sec   |
----------------------------------------------------

Reference:

lars · May 17, 2016, 6:19am

Hi,

I’m benchmarking a custom CNN model using the same environment as detailed above, including running the maxperf script. I just got the fp16 version to produce correct results and may have made a mistake somewhere, but I’m getting quite disappointing runtime performance when using fp16 tensors. About 48 FPS with fp32 as opposed to 42 FPS with fp16.

Here’s an image showing profiler timelines for fp16 and fp32 inference:

[url]http://postimg.org/image/j27n43i8h/full/[/url]

Note that the time scales of the two windows are not identical. Total inference time is about 20.4ms for fp32 and 23.4ms for fp16. The reason for fp16 running slower seems to be that multiple kernels are not launched/executed simultaneously to the same extent as when running with fp32 data.

Any ideas as to why this is happening and what I might try to fix it would be great.

lars · May 17, 2016, 7:18am

Looking closer at the graphs and stats in the visual profiler, my thoughts are that the fp16 slowdown is caused by the precomputed_convolve_sgemm kernel (2nd from top in the tables) using a different launch configuration when running fp16:

fp16: Block Size = [8,32,1], Smem/Block = 10.25KiB, Regs/Thread 128
fp32: Block Size = [8,8,1], Smem/Block = 4.5KiB, Regs/Thread 56

So my guess is that the larger block size and regs/thread of the fp16 launch is preventing this kernel from running in parallel with the implicit_convolve_sgemm kernel (1st in table), which seems to slow things down significantly.

Question is, what can be done about it?

jachen · May 20, 2016, 6:07am

Hello, lars:
To compare performance with fp16 and fp32, it’s better to use same parameters.

For different models/parameters, the performance should be different.

br
ChenJian

lars · May 24, 2016, 10:33am

Hi ChenJian,

My fp16 and fp32 test cases are identical, apart from the floating point precision being used. The exact same CNN model and exact same sequence of CUDNN/CUBLAS API calls are being made (with the exception of cublasSgemm vs cublasHgemm for the final, linear layers).

I’m guessing you’re referring to the difference in kernel launch configuration that I’m seeing, but that is being determined by CUDNN and is, as far as I know, out of my control.

sherrick · June 4, 2016, 2:10pm

Has anyone had issues in getting the same classification of an image from using DIGITS and when running the classifier.bin code? I trained my Alexnet model and then submitted an image for test and it came back correct, When I downloaded the model to the TX1 and ran the classifier.bin executable on the TX1 using the same test image as I did in DIGITS on the host I get the wrong answer. I have a trained mode for cars and no cars and in DIGITS I get 100% no cars for a no car image. When I do the same test on the TX1 using the same test no car image I get 100% cars. What could be the issue here?

Take · June 15, 2016, 1:37pm

Hi !!

I installed it in reference to this article.
“example” of “caffe” is OK!! Very fast.

However, “make runtest” stopped by an error. Do you not have any problem?

In this way
…
…
Cuda number of devices: 1
Setting to use device 0
Setting to use device 0
Current device id: 0
Current device name: GM20B
Note: Randomizing tests’ orders with a seed of 36162 .
[==========] Running 2508 tests from 360 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TanHLayerTest/2, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<caffe::float16, float> >
[ RUN ] TanHLayerTest/2.TestTanH
src/caffe/test/test_tanh_layer.cpp:64: Failure
The difference between expected_value and Get(top_data[i]) is 0.00022143125534057617, which exceeds tol(precision), where
expected_value evaluates to 0.77126294374465942,
Get(top_data[i]) evaluates to 0.771484375, and
tol(precision) evaluates to 7.7126293035689741e-05.
src/caffe/test/test_tanh_layer.cpp:64: Failure
The difference between expected_value and Get(top_data[i]) is 0.00010585784912109375, which exceeds tol(precision), where
…
…
…
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
tol(threshold_) * scale evaluates to 0.0200042724609375.
debug: (top_id, top_data_id, blob_id, feat_id)=0,125,0,422; feat = 1.3046875; objective+ = 2.609375; objective- = 2.609375
[ FAILED ] SPPLayerTest/2.TestGradient, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<caffe::float16, float> > (5579 ms)
[ RUN ] SPPLayerTest/2.TestSetup
[ OK ] SPPLayerTest/2.TestSetup (0 ms)
[ RUN ] SPPLayerTest/2.TestForwardBackward
[ OK ] SPPLayerTest/2.TestForwardBackward (1 ms)
[ RUN ] SPPLayerTest/2.TestEqualOutputDims
[ OK ] SPPLayerTest/2.TestEqualOutputDims (0 ms)
[----------] 5 tests from SPPLayerTest/2 (5581 ms total)

[----------] 6 tests from CuDNNConvolutionLayerTest/2, where TypeParam = caffe::MultiPrecision<caffe::float16, float>
[ RUN ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionCuDNN
[ OK ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionCuDNN (18 ms)
[ RUN ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionGroupCuDNN
[ OK ] CuDNNConvolutionLayerTest/2.TestSimpleConvolutionGroupCuDNN (1 ms)
[ RUN ] CuDNNConvolutionLayerTest/2.TestSobelConvolutionCuDNN
F0615 12:46:51.717325 3716 cudnn_conv_layer.cpp:138] Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED
*** Check failure stack trace: ***
@ 0x432a1060 (unknown)
@ 0x432a0f5c (unknown)
@ 0x432a0b78 (unknown)
@ 0x432a2f98 (unknown)
@ 0x43b224a0 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x13bdee caffe::Layer<>::SetUp()
@ 0x3c4350 caffe::CuDNNConvolutionLayerTest_TestSobelConvolutionCuDNN_Test<>::TestBody()
@ 0x4e5788 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x4dfd1a testing::Test::Run()
@ 0x4dfdaa testing::TestInfo::Run()
@ 0x4dfe82 testing::TestCase::Run()
@ 0x4e16da testing::internal::UnitTestImpl::RunAllTests()
@ 0x4e18cc testing::UnitTest::Run()
@ 0x13066a main
@ 0x44302670 (unknown)
make: *** [runtest] Aborted

How about everybody?

Systax · October 4, 2016, 7:15pm

Hello,

Thank you for all the detailled information. I was able to run GoogleNet FP16 and reach the results presented on the white paper!

Now I am a little bit confused about power consumption. I am checking out power consumption at main supply level (the full board). At idle (just logged to the Linux), I get 2W (which is quite low!). But when running the benchmark (googlenet) I get up to 12w ! (@1Ghz). At this frequency I reach 57 FPS (a rough 5 img/s/w) whereas on white paper it says ~5w => 8img/s/w.

I tried different frequencies and always get around 4-5 img/s/W.

I checked that the fan consumption is negligeable (not explaining the gap).

Please, can you provide me some clue about how the power measurement was done ? Do I need to lower the voltage settings (if it is possible to do so) ?

Regards,

linuxdev · October 4, 2016, 7:29pm

Be sure you measure with any USB device either removed or using an externally powered HUB.

EDIT: I forgot to mention, if you are not using WiFi, disable this as well.

Systax · October 4, 2016, 8:16pm

Hello,

Thank you for your reply. But I do not use WiFi nor have any device connected to the board. Moreover, I suspect that these devices would contribute to my ‘idle’ measure (the benchmark do not generate any activity on devices nor on wifi).

I searched through this forum, and on the datasheet and all says that a ‘fully loaded’ X1 GPU can reach 10~14W which matches my observations.

Therefore, I guess that there are special tweaks/settings required to lower power consuption and reach paper numbers. Or maybe I just missed something!

Regards,

dividiti · December 21, 2016, 9:33pm

Just an observation. The AlexNet performance numbers reported here are actually quite close to the ones in the whitepaper (https://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf):

[this post vs the whitepaper]
48 vs 47 (fp32, batch size 1)
73 vs 67 (fp16, batch size 1)

162 vs 155 (fp32, batch size 32 vs 128)
250 vs 258 (fp16, batch size 32 vs 128)

However, for the whitepaper experiments the frequency was set to 690 MHz, while the experiments reported here the frequency was set to the maximum (998 MHz?)

So the frequency doesn’t seem to play a big role (998 MHz vs 690 MHz), and neither is the large batch size (32 vs 128)…

mlml · April 19, 2017, 2:19am

Hi
There are some question:
ubuntu@tegra-ubuntu:~$ sudo apt-get install libboost-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev libatlas-base-dev liblmdb-dev libblas-dev libatlas-base-dev libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler
Reading package lists… Done
Building dependency tree
Reading state information… Done
Package protobuf-compiler is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package libboost-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package libprotobuf-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package libleveldb-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package ‘libboost-dev’ has no installation candidate
E: Package ‘libprotobuf-dev’ has no installation candidate
E: Package ‘libleveldb-dev’ has no installation candidate
E: Unable to locate package libsnappy-dev
E: Package ‘protobuf-compiler’ has no installation candidate
ubuntu@tegra-ubuntu:~$ sudo apt-get install libopencv-dev
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
libopencv-dev : Depends: libopencv-core-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-ml-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-imgproc-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-video-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-objdetect-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-highgui-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-calib3d-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-flann-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-features2d-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-legacy-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-contrib-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-ts-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-photo-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-videostab-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-stitching-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-gpu-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-superres-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-ocl-dev (= 2.4.9.1+dfsg-1.5ubuntu1) but it is not going to be installed
Depends: libopencv2.4-java (= 2.4.9.1+dfsg-1.5ubuntu1) but it is not going to be installed
Depends: libopencv2.4-jni (= 2.4.9.1+dfsg-1.5ubuntu1) but it is not going to be installed
Depends: libcv-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libhighgui-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libcvaux-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
E: Unable to correct problems, you have held broken packages.
what can be done about these?
Thanks

Morganh · April 19, 2017, 6:46am

Have you run below commands?
$ sudo add-apt-repository universe
$ sudo add-apt-repository multiverse
$ sudo apt-get update

shuyu0815 · August 22, 2017, 8:27am

Have everyone installed the nvcaffe0.16 on TX1 ?

Should I change the CUDA_ARCH part in Makefile.config ?

CUDA_ARCH part in nvcaffe0.16

CUDA_ARCH := -gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61

only insert the -gencode arch=compute_53,code=sm_53 \ ?

Thanks for help !

dhenzjhen · October 13, 2017, 4:52am

Hello Jachen,

I’m wondering how did you get the img/sec result?

Thanks,
Dennis

73 img/sec | 250 img/sec |
48 img/sec | 162 img/sec |

supportacqba · July 16, 2018, 12:31pm

I have Jetson Tx1 development kit opened the box but not used. I had bought for a Project but the Project got cancelled. So I was not able to use this Kit. I am now selling this development kit of Rs 48000/-. (negotiable). Please contact me on email id as.k231216@gmail.com. You guys can call me or mail me for the recent pictures of the kit. It will be helpful if anyone needs it.

Deep Learning Inference: Performance validation on TX1

73 img/sec | 250 img/sec | 48 img/sec | 162 img/sec |

73 img/sec | 250 img/sec |
48 img/sec | 162 img/sec |