Hello together,
Description
I am developing a RNN with Tensorflow as part of my Master Thesis. In this case i did compare Training on a Desktop PC with CUDA enabled GPU (RTX 2070) and on a Jetson Nano as well as on a Xavier-NX and on a virtual Strato Server.
I keep all my code in a private Github repo, therefore i can run the exact code on every System.
I have setup a fix Network that i train for 1000 Epochs to compare the Training Performance of each System (i know Jetson is not intended for training). I checkout the exact same commit on all Systems and then let it train for 1000 Epochs and compare the time it took for the 1000 Epochs and what loss it reaches at the end. All that i let run 10 times, to get a longer term overview.
I noticed that the achieved loss on the systems differ significantly
Losses | ||||
---|---|---|---|---|
Device | Desktop | Xavier-NX | Nano | Strato |
Loop1 | 0,8238 | 0,5101 | 1,137 | 0,8286 |
Loop12 | 0,8244 | 0,3763 | 0,3996 | 0,8297 |
Loop13 | 0,8248 | 0,5252 | 0,973 | 0,8306 |
Loop14 | 0,8245 | 0,6464 | 0,4718 | 0,8298 |
Loop15 | 0,8249 | 0,6566 | 0,2626 | 0,8291 |
Loop16 | 0,8243 | 0,3864 | 0,7825 | 0,8293 |
Loop17 | 0,8251 | 0,1701 | 0,5865 | 0,8301 |
Loop18 | 0,8249 | 2,659 | 1,17 | 0,8299 |
Loop19 | 0,8249 | 0,2787 | 1,3087 | 0,8298 |
Loop110 | 0,8241 | 1,0339 | 1,5388 | 0,8289 |
Average | 0,8 | 0,7 | 0,9 | 0,8 |
STD Deviation | 0,0 | 0,7 | 0,4 | 0,0 |
What causes the Losses to be very constant on the Desktop and Server but not on the Jetson Systems?
Environment
TensorRT Version:
GPU Type: EVGA RTX 2070
Nvidia Driver Version: 466.11
CUDA Version: 11.3.1
CUDNN Version: 11.3
Operating System + Version: Windows 10
Python Version (if applicable): 3.8
TensorFlow Version (if applicable): 2.5.0
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):