Tensorflow 2.2.0 training seems to fail on Jetson Xavier NX

I’m currently trying to use a python tensorflow based DNN learning framework created by a colleague of mine on the Jetson Xavier NX. We’ve successfully used this framework on my Workstation (Xeon based with an RTX4000) to train several networks (nmist, food101, cifar10 to name a few).

If I run his framework on my Jetson Xavier NX it appears like the training of any of these networks gets stuck at the initial training phase starter value. For instance for Cifar10 the initial starting value is 0.098 and when running the training for several epochs the test accuracy for each epoch will oscillate around the 0.10000 score and never improve. On the PC we see a steady improvement across each of the epochs (with the occasional decrease in score, but nothing like this almost fixed score).

After doing some debugging on both systems it seems that while loading the tensors from a tfrecords file, the NX code has values in those tensors with large exponent values (like 7.99347E+35). Not all tensor values are like that, most are still around the zero point. If I look at the content of the loaded tensors on the PC I don’t see these large numbers anywhere.

I’ve also looked at the part of the code where these tfrecord files are created from the input images (after scaling) and the values stored are UINT8 numbers. I’ve tried to analyze the resulting files in a hex editor but since I have no documentation for the internal structure of these files I couldn’t really determine anything.

As a test I copied the tfrecord files over to the PC version of the framework, but that doesn’t change the PC version’s behavior. That still produces a valid trained model.

Has anybody else run into issues with model training using python and TensorFlow on the Jetson Xavier NX (or any of the other Jetson boards for that matter) with the latest version of Tensorflow (Tensorflow 2.2.0+nv20.6 on JetPack 44/ L4T 32.4.3)?

Any help is welcome!

If anybody has succeeded in training a DNN model on the Jetson Xavier NX using Tensorflow 2.2.0, I’m also interested in hearing from you, just to see what you did differently.

— edit —

Just installed the new nv20.7 version of Tensorflow 2.2.0 to see if my issue was fixed, but no such luck, Training still optimizes towards the initial value, no matter how many epochs I let the script train for.

Just installed the 2.3.0 tensorflow version and now training behaves as expected. Not sure if the bug was in tensorflow 2.2.x and it got fixed in 2.3.0 or NVIDiA’s porting team introduced the bug in 2.2.x and fixed it in their port of 2.3.0.

As I never had these issues on my Windows PC with either 2.1.0 or 2.2.0, I’m going to assume it is the latter.

Anyway, it now works on the Jetson Xavier Nano, albeit only from version 2.3.0 onwards. If you need an earlier version of tensorflow, you are out of luck.