I trained a TF/Keras model (UNet architecture) with a Tesla K40. When I use it with the Jetson Xavier (Jetpack 4.4.1), however, I get very different results, despite I don’t get any error message (the only “strange” message i get is: ‘ARM64 does not support NUMA - returning NUMA node zero’ - but no failure).
Jetson output is very strange - this is an example of the output array (or part of it):
JETSON:
[0.07289112]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
[0.10021071]
(numbers change after a while, but not so many of them
while on TESLA:
[0.09022264]
[0.08399124]
[0.09759483]
[0.08107567]
[0.07902569]
[0.11560011]
[0.78396565]
[0.81917554]
[0.31143603]
[0.09954672]
[0.31214723]
[0.0928173 ]
[0.07590267]
[0.88202167]
[0.08934084]
Model was trained with TensorFlow 2.3.1 (tried also with TF 2.0)
I’m doing some more tests now, to check if I can find any fault in the code. I’ve been using this for months (always working properly) and only now I tested it on a Jetson AGX Xavier.
UPDATE: I tried to use PIL to open the .tif image and save it (instead of libtiff) but the results still don’t match (there are huge differences). Apparently, the problem seems to be using a model trained on a different architecture from Jetson.
UPDATE: if testing with a single image 128x128x26, everything works properly. It seems that dealing with a batch > 1 raises some sort of issue. I will try with a subsample. Can be this caused by a memory issue?
I did, but apparently is still stuck (from more than a hour) after:
2020-11-06 13:03:38.836944: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
I tested the code with only two samples and runs properly. I strongly suspect a problem with memory (to me, it seems like overwriting results. Batch_size = 2 still works, I might investigate the maximum number of samples that do not cause any issue in the results, but it wouldn’t be very productive for my use case and knowing how i could actually solve this would be much better.
gpus = tf.config.experimental.list_physical_devices('GPU')
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
but the results don’t change. I’m also running a cuda memcheck with this option set, but it’s stuck as always.
EDIT: I tried all the possible solution that came to my mind. At the end, what I noticed is that if I set the batch_size in the model.predict() method to 3 or less, results are correct. An alternative is to select only 3 (or less) inputs, instead of the entire collection, without setting the batch_size (which in model.predict() is 32 by defaults). Apparently, it seems that when performing inference on more than 3 samples at the same time TF fails and gives wrong results… Any idea/suggestion? cuda-memcheck got stuck 4 hours without giving any result…
about the first, I’m pretty sure I already did it and memory was not full. In any case, I already set the VirtualDeviceConfiguration memory limit to 4096 (also tried with lower values, but nothing changes).
These are the latest updates:
TF results match when using UNet if I set batch_size=1 in “model.predict()”
TF results match at 98% (2 wrong results out of 42) when using a CNN (that only classifies the pair) if I set batch_size=1 in “model.predict()”
TF-TRT results match if using FP32 or FP16 mode.
To me, it seems something buggy with TF on Jetson, but I might be wrong.