TX2 produces different results across runs and comparing that on a server using Tensorflow and Keras

I found TX2 produces different results across multiple runs and comparing that on a server with GPU. Since I may not be able to released the source code due to copyright issue, could anyone give me some hints to debug this issue based on my following descriptions and observations?

I am working on an object detection project using TX2. The deep neural networks (DNN) is customized on our own dataset based on [1]. It is already trained on the server and the weights are saved to a file. Now, the problems happen when I try to inference on several images on TX2 using pre-trained weights. Here I list several strange things,
(1) Using a fixed input, the output of DNN on TX2 is different from that on GPU servers (I compared with two servers, one with NVIDIA P100, another with NVIDIA K20c). The inference script is modified based on [4]. “gpu_options.allow_growth = True” is set for tensorflow session.
(2) Using a fixed input, the output of DNN on TX2 varies between EACH pair of runs. I never got the same output, while this never happens when I use the same script, same model, same weights on any of the GPU servers.

Personal debugging observations,
(3) Although I don’t think the model contains any random layers at inference time, I tried to fix the random seeds in numpy, random and tensorflow library while still got the same problems.
(4) From which layer the results begins to diverge depends on how I run the inference code. If I use a .py script, it diverges immediately after a convolution layer. If I use jupyter notebook, chunk the .py scripts into 3 cells – first importing tensorflow, keras and private library and set sessions, second loading weights, third inferencing on images, run 1st and 2nd cells only once at the beginning, run 3rd cells for multiple times, it diverges in a few intermediate layers and the final output layer.
(5) The absolute difference of the output (TX2 vs server) at some intermediate layers is mostly at 1e-3 level (the activation is usually at ~10 level) but the different elements cover 50%~90% of the n-dim matrix.
(6) The absolute difference of the output (TX2 vs server) at dilated convolution layer (one intermediate layer in our model) increases dramatically and thus all layers depending on this layer produce totally erroneous output.

Device: NVIDIA Tegra X2
OS: Ubuntu 16.04
Software: Python 3.5 + Virtualenv + pip-installed tensorflow 1.8.0 [3] + Keras 2.2.0
Note: I got a lot of errors when I install Keras on TX2 and thus forgot the clear procedure how it is installed.

References:
[1] https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd512.py
[2] https://towardsdatascience.com/understanding-2d-dilated-convolution-operation-with-examples-in-numpy-and-tensorflow-with-d376b3972b25
[3] https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-8-wheel-with-jetpack-3-2-/
[4] https://github.com/pierluigiferrari/ssd_keras/blob/master/ssd512_inference.ipynb

Hi,

Could you help us to do the following experiments:

1) Please try if you can reproduce this issue by the ref.1 model without customized?

2) Please try if this issue will occur in the CPU-only mode?

Thanks.

Hi AastaLLL,
I did some follow-up experiments.
Experiment 1. I forked [1] in my own repo [5] and use simpler scripts to test. They are “ssd512_inference_gpu.py”, “ssd512_inference_gpu_tx2.py” and “ssd512_inference_cpu.py” for gpu-server, gpu-tx2 and cpu modes. Here are my observations,
(1) “ssd512_inference_cpu.py” produces consistent results across 3 runs on each device (NVIDIA TX2, NVIDIA K20c GPU, NVIDIA P100 GPU). The results are also consistent across the three devices. – Good news!
(2) “ssd512_inference_gpu.py” produces consistent results across 3 runs on the server gpu. The results are also consistent across the two servers.
(3) “ssd512_inference_gpu_tx2.py” fails on TX2. The error information is as follows,

Traceback (most recent call last):
  File "ssd512_inference_gpu_tx2.py", line 100, in <module>
    y_pred = model.predict(input_images)
  File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/engine/training.py", line 1172, in predict
    steps=steps)
  File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 297, in predict_loop
    batch_outs = f(ins_batch)
  File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2661, in __call__
    return self._call(inputs)
  File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2631, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1454, in __call__
    self._session._session, self._handle, args, status, None)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
	 [[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/switch_f/_580 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1021_...d/switch_f", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_490)]]

The warning and stdout before the error occurs are as follows,

2019-03-12 22:35:00.241118: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2019-03-12 22:35:00.241274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 3.30GiB
2019-03-12 22:35:00.241321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2019-03-12 22:35:01.068244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-12 22:35:01.068349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2019-03-12 22:35:01.068378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2019-03-12 22:35:01.068578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2882 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Using TensorFlow backend.
2019-03-12 22:35:17.144352: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.36GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:18.538466: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.69GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:18.752129: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:21.512948: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:21.929950: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:287 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices.  temp_storage_bytes: 767, status: too many resources requested for launch

Experiment 2: I ran my code in CPU-only model. Here are my observations,
(1) The code produces consistent results across 3 runs on each device (NVIDIA TX2, NVIDIA K20c GPU, NVIDIA P100 GPU). – Good news!
(2) The results are different between any two of the three devices. – Cannot understand.

To summarize,
(1) Considering the open-sourced code in CPU-only mode [5], all devices have consistent outputs. This is a good result that we can base on.
(2) Considering the open-sourced code in GPU mode, it fails on TX2, which we need to first work out how to successfully run it.
(3) Considering my project code in CPU-only mode. Three devices have different outputs, although themselves is self-consistent across multiple runs. This result is contradictory to what I got in (1) and I cannot find the reason.
(4) Considering my project code in GPU mode. It is more inconsistent as described in #1 post and I may probably want to hear your thoughts on my two follow-up experiments.

p.s. To debug the code on TX2’s GPU in experiment 1, I am tracking some related previous topics in [6], [7] and learning some basic low-level CUDA codes and basics in [8], [9]. Hopefully, I can make it work because my modified code work on TX2’s GPU. While, your feedbacks are always welcomed.

References:
[1] https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd512.py
[5] https://github.com/StarsThu2016/ssd_keras
[6] https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/2
[7] https://devtalk.nvidia.com/default/topic/1033209/jetson-tx2/general-question-about-jetsons-gpu-cpu-shared-memory-usage/
[8] https://devblogs.nvidia.com/unified-memory-cuda-beginners/
[9] https://devtalk.nvidia.com/default/topic/1039179/cuda-programming-and-performance/cudamemgetinfo-free-mem-value-is-not-correct/

Hi,

Is there any possibility that your model/app has some randomness?

If I understand you correctly, the customized app yields some difference cross-platform and cross-process.
So this issue may not be related to the system or GPU but occurs from the implementation or model.

Thanks.

Hi AastaLLL,

Thanks very much for the analysis. We finally find out the large bias between server execution and TX2 execution origins from an atrous convolutional layer (line 299 in [1]). By modifying this unusual convolutional layer to a normal convolutional layer, the difference of the output between a server and TX2 becomes negligible. We do not know the root cause why this atrous convolutional layer causes such execution difference, it maybe because Keras, Tensorflow, any embedded computation library, CuDNN, CUDA or the hardware. The software stack is too deep to locate the root cause.

Reference:
[1] https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd512.py

Many thanks.