TensorFlow getting different results on Jetson

Hi,

I trained a TF/Keras model (UNet architecture) with a Tesla K40. When I use it with the Jetson Xavier (Jetpack 4.4.1), however, I get very different results, despite I don’t get any error message (the only “strange” message i get is: ‘ARM64 does not support NUMA - returning NUMA node zero’ - but no failure).

Jetson output is very strange - this is an example of the output array (or part of it):

JETSON:

[0.07289112]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
(numbers change after a while, but not so many of them

while on TESLA:
[0.09022264]
[0.08399124]
[0.09759483]
[0.08107567]
[0.07902569]
[0.11560011]
[0.78396565]
[0.81917554]
[0.31143603]
[0.09954672]
[0.31214723]
[0.0928173 ]
[0.07590267]
[0.88202167]
[0.08934084]

Model was trained with TensorFlow 2.3.1 (tried also with TF 2.0)

Hi,

We don’t notice this issue before and want to reproduce this in our environment.
Could you share the TensorFlow script and required data with us?

Thanks.

Here’s the standalone code to reproduce the issue:

I’m doing some more tests now, to check if I can find any fault in the code. I’ve been using this for months (always working properly) and only now I tested it on a Jetson AGX Xavier.

UPDATE: I tried to use PIL to open the .tif image and save it (instead of libtiff) but the results still don’t match (there are huge differences). Apparently, the problem seems to be using a model trained on a different architecture from Jetson.

UPDATE: if testing with a single image 128x128x26, everything works properly. It seems that dealing with a batch > 1 raises some sort of issue. I will try with a subsample. Can be this caused by a memory issue?

Hi,

Thanks for the update.
To check memory issue, please run the script with cuda-memcheck?

Ex.

$ sudo /usr/local/cuda-10.2/bin/cuda-memcheck python3 test.py

Thanks.

I did, but apparently is still stuck (from more than a hour) after:

2020-11-06 13:03:38.836944: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10

I tested the code with only two samples and runs properly. I strongly suspect a problem with memory (to me, it seems like overwriting results. Batch_size = 2 still works, I might investigate the maximum number of samples that do not cause any issue in the results, but it wouldn’t be very productive for my use case and knowing how i could actually solve this would be much better.

Brief UPDATE: I tried to use:

gpus = tf.config.experimental.list_physical_devices('GPU')

# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:

tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])

but the results don’t change. I’m also running a cuda memcheck with this option set, but it’s stuck as always.

EDIT: I tried all the possible solution that came to my mind. At the end, what I noticed is that if I set the batch_size in the model.predict() method to 3 or less, results are correct. An alternative is to select only 3 (or less) inputs, instead of the entire collection, without setting the batch_size (which in model.predict() is 32 by defaults). Apparently, it seems that when performing inference on more than 3 samples at the same time TF fails and gives wrong results… Any idea/suggestion? cuda-memcheck got stuck 4 hours without giving any result…

Hi,

First, could you monitor the device status with tegrastats to see if it reaches the memory limit.

$ sudo tegrastats

And, to avoid TensorFlow occupy the memory too early, please add the configure shared below and try it again:

Thanks.

Hi,

about the first, I’m pretty sure I already did it and memory was not full. In any case, I already set the VirtualDeviceConfiguration memory limit to 4096 (also tried with lower values, but nothing changes).

These are the latest updates:

  1. TF results match when using UNet if I set batch_size=1 in “model.predict()”
  2. TF results match at 98% (2 wrong results out of 42) when using a CNN (that only classifies the pair) if I set batch_size=1 in “model.predict()”
  3. TF-TRT results match if using FP32 or FP16 mode.

To me, it seems something buggy with TF on Jetson, but I might be wrong.