I’m currently trying to use a python tensorflow based DNN learning framework created by a colleague of mine on the Jetson Xavier NX. We’ve successfully used this framework on my Workstation (Xeon based with an RTX4000) to train several networks (nmist, food101, cifar10 to name a few).
If I run his framework on my Jetson Xavier NX it appears like the training of any of these networks gets stuck at the initial training phase starter value. For instance for Cifar10 the initial starting value is 0.098 and when running the training for several epochs the test accuracy for each epoch will oscillate around the 0.10000 score and never improve. On the PC we see a steady improvement across each of the epochs (with the occasional decrease in score, but nothing like this almost fixed score).
After doing some debugging on both systems it seems that while loading the tensors from a tfrecords file, the NX code has values in those tensors with large exponent values (like 7.99347E+35). Not all tensor values are like that, most are still around the zero point. If I look at the content of the loaded tensors on the PC I don’t see these large numbers anywhere.
I’ve also looked at the part of the code where these tfrecord files are created from the input images (after scaling) and the values stored are UINT8 numbers. I’ve tried to analyze the resulting files in a hex editor but since I have no documentation for the internal structure of these files I couldn’t really determine anything.
As a test I copied the tfrecord files over to the PC version of the framework, but that doesn’t change the PC version’s behavior. That still produces a valid trained model.
Has anybody else run into issues with model training using python and TensorFlow on the Jetson Xavier NX (or any of the other Jetson boards for that matter) with the latest version of Tensorflow (Tensorflow 2.2.0+nv20.6 on JetPack 44/ L4T 32.4.3)?
Any help is welcome!