Losses are different after training on Jetson or Desktop

Hello together,

Description

I am developing a RNN with Tensorflow as part of my Master Thesis. In this case i did compare Training on a Desktop PC with CUDA enabled GPU (RTX 2070) and on a Jetson Nano as well as on a Xavier-NX and on a virtual Strato Server.
I keep all my code in a private Github repo, therefore i can run the exact code on every System.

I have setup a fix Network that i train for 1000 Epochs to compare the Training Performance of each System (i know Jetson is not intended for training). I checkout the exact same commit on all Systems and then let it train for 1000 Epochs and compare the time it took for the 1000 Epochs and what loss it reaches at the end. All that i let run 10 times, to get a longer term overview.

I noticed that the achieved loss on the systems differ significantly

Losses
Device Desktop Xavier-NX Nano Strato
Loop1 0,8238 0,5101 1,137 0,8286
Loop12 0,8244 0,3763 0,3996 0,8297
Loop13 0,8248 0,5252 0,973 0,8306
Loop14 0,8245 0,6464 0,4718 0,8298
Loop15 0,8249 0,6566 0,2626 0,8291
Loop16 0,8243 0,3864 0,7825 0,8293
Loop17 0,8251 0,1701 0,5865 0,8301
Loop18 0,8249 2,659 1,17 0,8299
Loop19 0,8249 0,2787 1,3087 0,8298
Loop110 0,8241 1,0339 1,5388 0,8289
Average 0,8 0,7 0,9 0,8
STD Deviation 0,0 0,7 0,4 0,0

What causes the Losses to be very constant on the Desktop and Server but not on the Jetson Systems?

Environment

TensorRT Version:
GPU Type: EVGA RTX 2070
Nvidia Driver Version: 466.11
CUDA Version: 11.3.1
CUDNN Version: 11.3
Operating System + Version: Windows 10
Python Version (if applicable): 3.8
TensorFlow Version (if applicable): 2.5.0
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Hi,
This looks like a Jetson issue. Please refer to the below samlples in case useful.

For any further assistance, we recommend you to raise it to the respective platform from the below link

Thanks!

Hi,

Could you share which JetPack/TensorFlow version do you use for Nano and XavierNX?

If the TensorFlow version is not v2.5.0.
Would you mind to align the TensorFlow version first and try it again?

Thanks.

I pip upgraded my Jetson to Tensorflow 2.5 (it was 2.4 before as suggested by Nvidia Jetson Guide).
Now i get the "Illegal instruction (core dumped) error when importing Tensorflow 2.5
Is there a proper guide/instruction to resolve this?

I was able to update to Tensorflow 2.5 with the Nvidia provided Guideline now.
The Losses are now as invariant as they are on the Desktop Systems running TF 2.5 already.
Thanks for helping to solve this Problem.
Is there an explanation why TF 2.4 behaves different? I would not have expected to see such major differences between the version, especially since the Behaviour should be based on the optimizers.

Hi,

Since you have tried TF 2.4 and TF 2.5 on Jetson, and observe the difference.
It seems the root cause is in TensorFlow. Maybe you can check with the TensorFlow team for the details.

Thanks.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.