Hi AastaLLL,
I did some follow-up experiments.
Experiment 1. I forked [1] in my own repo [5] and use simpler scripts to test. They are “ssd512_inference_gpu.py”, “ssd512_inference_gpu_tx2.py” and “ssd512_inference_cpu.py” for gpu-server, gpu-tx2 and cpu modes. Here are my observations,
(1) “ssd512_inference_cpu.py” produces consistent results across 3 runs on each device (NVIDIA TX2, NVIDIA K20c GPU, NVIDIA P100 GPU). The results are also consistent across the three devices. – Good news!
(2) “ssd512_inference_gpu.py” produces consistent results across 3 runs on the server gpu. The results are also consistent across the two servers.
(3) “ssd512_inference_gpu_tx2.py” fails on TX2. The error information is as follows,
Traceback (most recent call last):
File "ssd512_inference_gpu_tx2.py", line 100, in <module>
y_pred = model.predict(input_images)
File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/engine/training.py", line 1172, in predict
steps=steps)
File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/engine/training_arrays.py", line 297, in predict_loop
batch_outs = f(ins_batch)
File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2661, in __call__
return self._call(inputs)
File "/home/nvidia/DroneNavi/PretrainedModel/env3_tx2/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2631, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1454, in __call__
self._session._session, self._handle, args, status, None)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices. temp_storage_bytes: 767, status: too many resources requested for launch
[[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Where = Where[T=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoded_predictions/loop_over_batch/while/loop_over_classes/while/boolean_mask/Reshape_1)]]
[[Node: decoded_predictions/loop_over_batch/while/loop_over_classes/while/cond/switch_f/_580 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1021_...d/switch_f", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoded_predictions/loop_over_batch/while/loop_over_classes/while/TensorArrayReadV3/_490)]]
The warning and stdout before the error occurs are as follows,
2019-03-12 22:35:00.241118: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2019-03-12 22:35:00.241274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 3.30GiB
2019-03-12 22:35:00.241321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2019-03-12 22:35:01.068244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-12 22:35:01.068349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2019-03-12 22:35:01.068378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2019-03-12 22:35:01.068578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2882 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
Using TensorFlow backend.
2019-03-12 22:35:17.144352: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.36GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:18.538466: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.69GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:18.752129: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.25GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:21.512948: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-03-12 22:35:21.929950: W tensorflow/core/framework/op_kernel.cc:1318] OP_REQUIRES failed at where_op.cc:287 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to count number of true / nonzero indices. temp_storage_bytes: 767, status: too many resources requested for launch
Experiment 2: I ran my code in CPU-only model. Here are my observations,
(1) The code produces consistent results across 3 runs on each device (NVIDIA TX2, NVIDIA K20c GPU, NVIDIA P100 GPU). – Good news!
(2) The results are different between any two of the three devices. – Cannot understand.
To summarize,
(1) Considering the open-sourced code in CPU-only mode [5], all devices have consistent outputs. This is a good result that we can base on.
(2) Considering the open-sourced code in GPU mode, it fails on TX2, which we need to first work out how to successfully run it.
(3) Considering my project code in CPU-only mode. Three devices have different outputs, although themselves is self-consistent across multiple runs. This result is contradictory to what I got in (1) and I cannot find the reason.
(4) Considering my project code in GPU mode. It is more inconsistent as described in #1 post and I may probably want to hear your thoughts on my two follow-up experiments.
p.s. To debug the code on TX2’s GPU in experiment 1, I am tracking some related previous topics in [6], [7] and learning some basic low-level CUDA codes and basics in [8], [9]. Hopefully, I can make it work because my modified code work on TX2’s GPU. While, your feedbacks are always welcomed.
References:
[1] https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd512.py
[5] https://github.com/StarsThu2016/ssd_keras
[6] https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/2
[7] https://devtalk.nvidia.com/default/topic/1033209/jetson-tx2/general-question-about-jetsons-gpu-cpu-shared-memory-usage/
[8] Unified Memory for CUDA Beginners | NVIDIA Technical Blog
[9] https://devtalk.nvidia.com/default/topic/1039179/cuda-programming-and-performance/cudamemgetinfo-free-mem-value-is-not-correct/