Dlib testing on Jetson Nano, TX2 and Xavier

I know there are many out there using these platforms that use Davis King’s Dlib software.

Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software.

http://dlib.net/

Having used this for some time now,
on a number of architectures including x86 and various arm based hardware.
Indeed I have been using it regularly on a number of the Jetson development environments.
In particular the Jetson nano, TX2 and Xavier.
With some success and without any problems that I was aware of.

During some recent testing, I was exercising the test suite that Davis provides with dlib,
to test the various units that make up the library.

The test suite performs flawlessly on all the x86 boxes with various NVIDIA hardware on board and CUDA enabled.

However, on all of the Jetson platforms there is one of the tests that fails identically on
the Jetson machines with cuDNN installed and enabled.

if you exercise
./dtest -d -l all --test_dnn

This will fail on all of the Jetson machines using CUDA.
The particular failure shows up with a gradient_error returned 4.90299+e28.
Obviously a bad out of range error!

The test suite seems well designed and implemented.

If you build dlib without CUDA enabled the tests will pass. ie. with DLIB_USE_CUDA=0.
So a software implementation of the DNN works as it is suppose to and as it does on the X86 boxes.
All be is slower!

here is the end of the x86 execution using CUDA:

57469 INFO  [0] test.dnn: slope_error: 0.000217438
57469 INFO  [0] test.dnn: intercept_error: 0.00847244
62949 INFO  [0] test.dnn: rs.mean(): 0.0057435
62949 INFO  [0] test.dnn: rs.stddev(): 0.00305919
62949 INFO  [0] test.dnn: rs.max(): 0.00976033
74753 INFO  [0] test.main: Testing Finished
74753 INFO  [0] test.main: Total number of individual testing statements executed: 563439
74753 INFO  [0] test.main: All tests completed successfully

here is the end of the Jetson Nano execution using CUDA:

7698 ERROR [0] test.main: Failure message from test: 

Error occurred at line 933.
Error occurred in file /h/rfg/w/dlib/dlib/test/dnn.cpp.
Failing expression was max(abs(mat(data_gradient1)-mat(data_gradient2))) < 1e-3.

 7698 INFO  [0] test.main: Testing Finished
 7698 INFO  [0] test.main: Total number of individual testing statements executed: 473
 7698 WARN  [0] test.main: Number of failed tests: 1
 7698 WARN  [0] test.main: Number of passed tests: 0

Here is the end of the Jetson TX2 execution using CUDA:

8059 ERROR [0] test.main: Failure message from test: 
Error occurred at line 933.
Error occurred in file /x/rfg/tx2/w/dlib/dlib/test/dnn.cpp.
Failing expression was max(abs(mat(data_gradient1)-mat(data_gradient2))) < 1e-3.

 8060 INFO  [0] test.main: Testing Finished
 8060 INFO  [0] test.main: Total number of individual testing statements executed: 473
 8060 WARN  [0] test.main: Number of failed tests: 1
 8060 WARN  [0] test.main: Number of passed tests: 0

And Here is end of the Jetson TX2 execution NOT using CUDA:

212068 INFO  [0] test.dnn: slope_error: 9.53674e-05
212068 INFO  [0] test.dnn: intercept_error: 0.00631332
220693 INFO  [0] test.dnn: rs.mean(): 0.00574357
220693 INFO  [0] test.dnn: rs.stddev(): 0.00305946
220693 INFO  [0] test.dnn: rs.max(): 0.00976036
469043 INFO  [0] test.main: Testing Finished
469043 INFO  [0] test.main: Total number of individual testing statements executed: 516379
469043 INFO  [0] test.main: All tests completed successfully

So it seems that there might be something amiss with the cuDNN implementation on
the Jetson hardware.
The implementation on the x86 will always work on the x86 hardware with CUDA enabled and cuDNN installed.
Likewise the builds on the Jetson Nano, TX2 and Xavier will always fail in the same way
with CUDA enabled. CUDA enabled is the default.

Davis’ website and git repository give excellent instructions on building, installing, and testing
the software.

As mentioned earlier I have been using dlib without any problems that I was aware of with my applications, it just that the test suite will report these errors.
Plus it is the DNN test that is failing, something that my applications just happen to be using!

Regards,

Ross
bald_guys_errors.txt (47.9 KB)
bald_guys_noerrors.txt (63.3 KB)

Hi GrunPferd,

We’re unfamiliar with DLib, and not sure why the test the you mentioned would fail, may other developers help to share experience.

Thanks kayccc for your interest,

While YOU might be unfamiliar with dlib, a simple search of dlib on these forums does yield several issues that this library has shown in the past.

In particular a fellow moderator from these forums has indicated that a particular issue was to be addressed with a future release of the cuDNN library for the Jetson platforms.

see:
https://devtalk.nvidia.com/default/topic/1049660/jetson-nano/issues-with-dlib-library/3
especially replies by AastaLLL in messages #39 #42.

The solution then was to apply a patch to the source and to build locally, making sure there where no other versions of the library installed.

As far as I am aware this only applied to the Jetson Nano.
Indeed the patch was applied to the dlib library as a work around for a problem that appears to be related to cuDNN.

However, this appears to be a different and separate problem to the one mentioned here. Even with this patch applied there is a problem, yet if the software implementation is used it works as expected. It is only the CUDA version on Jetson platforms that fails. The CUDA version on the x86 platforms works as expected.

Further there was talk of an update to the Jetson machines of another version of the cuDNN library to at least 7.6.1. As far as I can tell, that has not been released yet, although I can see cuDNN 7.6.5 does appear to be available for the x86 machines.

As mentioned in the earlier posting,
the dlib library is well documented and easy to install,
The unit test suite is also well documented and likewise easy installation instructions,
so it is relatively easy to reproduce the above error.

I have been using both dlib 19.18 and the dlib git master head for testing across the Jetson machines
and also the x86 machines used with NVIDIA hardware.

If there is any beta software available for the Jetson machines I would be willing give that a test with the dlib library.

Regards,

Ross

Hi Ross,

We have JetPack 4.3 Developer Preview release as an early look at two JetPack 4.3 components: TensorRT 6.0.1 and cuDNN 7.6.3, but only Jetson AGX Xavier Developer Kit is supported.
However, you may try to do some experiments as the previous case, please refer the following if could get it work on Jetson TX2:
https://devtalk.nvidia.com/default/topic/1042942/jetson-tx2/jetpack-4-1/post/5290610/#5290610

G’day kayccc,

I grabbed a copy of the JetPack 4.3 Developer Preview.

I installed the newer libraries of libcudnn and tensorRT.

The first system I had available for testing was a jetson nano development board.
Rebuilt dlib, just in case any of the headers had changed.

Run through the tests as above.

Happy to report that the problem test produces the following output:

411930 INFO [0] test.dnn: slope_error: 0.000217438
411930 INFO [0] test.dnn: intercept_error: 0.00847244
465378 INFO [0] test.dnn: rs.mean(): 0.0057435
465379 INFO [0] test.dnn: rs.stddev(): 0.00305919
465379 INFO [0] test.dnn: rs.max(): 0.00976033
550765 INFO [0] test.main: Testing Finished
550765 INFO [0] test.main: Total number of individual testing statements executed: 563439
550765 INFO [0] test.main: All tests completed successfully

So it appears that the cuDNN 7.6.3 does indeed fix this test case.

It would be good if this was more widely available.

Starting tests of the python test examples:

getting reports of NAN return values back in some of the vectors.
So there maybe still some issues here.

I have attached two files that show the results as produced on the jetson with the errors
and the results as produced on the x86 machine with nvidia cuda hardware with out the errors.

From within the python_examples directory.
here is the test case I am using:

python3 ./face_recognition.py …/examples/shape_predictor_5_face_landmarks.dat …/examples/dlib_face_recognition_resnet_model_v1.dat …/examples/faces_test

where the faces_test directory is a copy of the examples/faces directory but with the one test image:
bald_guys.jpg

Regards,

Ross

Hi,

JetPack 4.3 DP is only available for the Jetson Xavier right now.
Would you mind to test this on the Xavier or wait for our official release for Nano?

It might be workable to use it re-flash other platform.
But it’s challenging for us to narrow down the issue for a non-official environment.

By the way, it’s recommended to setup the whole system with the same JetPack version.
If you are installing some packages from JetPack4.3 DP, please also reflash the system as well.

There might be some hidden dependencies and causes some unexpected launch failure.

Thanks.