To achieve high performance, Caffe and cuDNN 4 library are combined on TX1 to provides a series of optimizations for inference.
cuDNN v4 applied in caffe to be optimized inference on small batch size, and improve performance of reduced precision floating point particularly. FP16 arithmetic delivers up to 2x the performance of equivalent FP32 arithmetic, below info will reveal the details to perform performance result based on current release packages on TX1.
* Env-setup:
cuDNN v4, cuda 7.0.73, and r23.1 L4T image packages are required.
I’m benchmarking a custom CNN model using the same environment as detailed above, including running the maxperf script. I just got the fp16 version to produce correct results and may have made a mistake somewhere, but I’m getting quite disappointing runtime performance when using fp16 tensors. About 48 FPS with fp32 as opposed to 42 FPS with fp16.
Here’s an image showing profiler timelines for fp16 and fp32 inference:
Note that the time scales of the two windows are not identical. Total inference time is about 20.4ms for fp32 and 23.4ms for fp16. The reason for fp16 running slower seems to be that multiple kernels are not launched/executed simultaneously to the same extent as when running with fp32 data.
Any ideas as to why this is happening and what I might try to fix it would be great.
Looking closer at the graphs and stats in the visual profiler, my thoughts are that the fp16 slowdown is caused by the precomputed_convolve_sgemm kernel (2nd from top in the tables) using a different launch configuration when running fp16:
So my guess is that the larger block size and regs/thread of the fp16 launch is preventing this kernel from running in parallel with the implicit_convolve_sgemm kernel (1st in table), which seems to slow things down significantly.
My fp16 and fp32 test cases are identical, apart from the floating point precision being used. The exact same CNN model and exact same sequence of CUDNN/CUBLAS API calls are being made (with the exception of cublasSgemm vs cublasHgemm for the final, linear layers).
I’m guessing you’re referring to the difference in kernel launch configuration that I’m seeing, but that is being determined by CUDNN and is, as far as I know, out of my control.
Has anyone had issues in getting the same classification of an image from using DIGITS and when running the classifier.bin code? I trained my Alexnet model and then submitted an image for test and it came back correct, When I downloaded the model to the TX1 and ran the classifier.bin executable on the TX1 using the same test image as I did in DIGITS on the host I get the wrong answer. I have a trained mode for cars and no cars and in DIGITS I get 100% no cars for a no car image. When I do the same test on the TX1 using the same test no car image I get 100% cars. What could be the issue here?
I installed it in reference to this article.
“example” of “caffe” is OK!! Very fast.
However, “make runtest” stopped by an error. Do you not have any problem?
In this way
…
…
Cuda number of devices: 1
Setting to use device 0
Setting to use device 0
Current device id: 0
Current device name: GM20B
Note: Randomizing tests’ orders with a seed of 36162 .
[==========] Running 2508 tests from 360 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TanHLayerTest/2, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<caffe::float16, float> >
[ RUN ] TanHLayerTest/2.TestTanH
src/caffe/test/test_tanh_layer.cpp:64: Failure
The difference between expected_value and Get(top_data[i]) is 0.00022143125534057617, which exceeds tol(precision), where
expected_value evaluates to 0.77126294374465942,
Get(top_data[i]) evaluates to 0.771484375, and
tol(precision) evaluates to 7.7126293035689741e-05.
src/caffe/test/test_tanh_layer.cpp:64: Failure
The difference between expected_value and Get(top_data[i]) is 0.00010585784912109375, which exceeds tol(precision), where
…
…
…
computed_gradient evaluates to 2,
estimated_gradient evaluates to 0, and
tol(threshold_) * scale evaluates to 0.0200042724609375.
debug: (top_id, top_data_id, blob_id, feat_id)=0,125,0,422; feat = 1.3046875; objective+ = 2.609375; objective- = 2.609375
[ FAILED ] SPPLayerTest/2.TestGradient, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<caffe::float16, float> > (5579 ms)
[ RUN ] SPPLayerTest/2.TestSetup
[ OK ] SPPLayerTest/2.TestSetup (0 ms)
[ RUN ] SPPLayerTest/2.TestForwardBackward
[ OK ] SPPLayerTest/2.TestForwardBackward (1 ms)
[ RUN ] SPPLayerTest/2.TestEqualOutputDims
[ OK ] SPPLayerTest/2.TestEqualOutputDims (0 ms)
[----------] 5 tests from SPPLayerTest/2 (5581 ms total)
Thank you for all the detailled information. I was able to run GoogleNet FP16 and reach the results presented on the white paper!
Now I am a little bit confused about power consumption. I am checking out power consumption at main supply level (the full board). At idle (just logged to the Linux), I get 2W (which is quite low!). But when running the benchmark (googlenet) I get up to 12w ! (@1Ghz). At this frequency I reach 57 FPS (a rough 5 img/s/w) whereas on white paper it says ~5w => 8img/s/w.
I tried different frequencies and always get around 4-5 img/s/W.
I checked that the fan consumption is negligeable (not explaining the gap).
Please, can you provide me some clue about how the power measurement was done ? Do I need to lower the voltage settings (if it is possible to do so) ?
Thank you for your reply. But I do not use WiFi nor have any device connected to the board. Moreover, I suspect that these devices would contribute to my ‘idle’ measure (the benchmark do not generate any activity on devices nor on wifi).
I searched through this forum, and on the datasheet and all says that a ‘fully loaded’ X1 GPU can reach 10~14W which matches my observations.
Therefore, I guess that there are special tweaks/settings required to lower power consuption and reach paper numbers. Or maybe I just missed something!
[this post vs the whitepaper]
48 vs 47 (fp32, batch size 1)
73 vs 67 (fp16, batch size 1)
162 vs 155 (fp32, batch size 32 vs 128)
250 vs 258 (fp16, batch size 32 vs 128)
However, for the whitepaper experiments the frequency was set to 690 MHz, while the experiments reported here the frequency was set to the maximum (998 MHz?)
So the frequency doesn’t seem to play a big role (998 MHz vs 690 MHz), and neither is the large batch size (32 vs 128)…
Hi
There are some question:
ubuntu@tegra-ubuntu:~$ sudo apt-get install libboost-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev libatlas-base-dev liblmdb-dev libblas-dev libatlas-base-dev libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler
Reading package lists… Done
Building dependency tree
Reading state information… Done
Package protobuf-compiler is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
Package libboost-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
Package libprotobuf-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
Package libleveldb-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
E: Package ‘libboost-dev’ has no installation candidate
E: Package ‘libprotobuf-dev’ has no installation candidate
E: Package ‘libleveldb-dev’ has no installation candidate
E: Unable to locate package libsnappy-dev
E: Package ‘protobuf-compiler’ has no installation candidate
ubuntu@tegra-ubuntu:~$ sudo apt-get install libopencv-dev
Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
libopencv-dev : Depends: libopencv-core-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-ml-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-imgproc-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-video-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-objdetect-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-highgui-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-calib3d-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-flann-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-features2d-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-legacy-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-contrib-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-ts-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-photo-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-videostab-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-stitching-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-gpu-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-superres-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libopencv-ocl-dev (= 2.4.9.1+dfsg-1.5ubuntu1) but it is not going to be installed
Depends: libopencv2.4-java (= 2.4.9.1+dfsg-1.5ubuntu1) but it is not going to be installed
Depends: libopencv2.4-jni (= 2.4.9.1+dfsg-1.5ubuntu1) but it is not going to be installed
Depends: libcv-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libhighgui-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
Depends: libcvaux-dev (= 2.4.9.1+dfsg-1.5ubuntu1)
E: Unable to correct problems, you have held broken packages.
what can be done about these?
Thanks
I have Jetson Tx1 development kit opened the box but not used. I had bought for a Project but the Project got cancelled. So I was not able to use this Kit. I am now selling this development kit of Rs 48000/-. (negotiable). Please contact me on email id as.k231216@gmail.com. You guys can call me or mail me for the recent pictures of the kit. It will be helpful if anyone needs it.