Anyone has experience with TensorFlow on Nano? How is the performance? I just tried a simple script and it runs nearly twice as fast on CPU when compared with running it on CUDA.
I followed the steps to install TensorFlow: https://docs.nvidia.com/deeplearning/dgx/install-tf-xavier/index.html (similar to the steps mentioned in another topic here)
Then I tried to run this sample, which is modified from from the book “TensorFlow for Deep Learning Companion Code”
https://github.com/kitsook/dlwithtf/blob/fix-lost-function/ch3/linear_regression_tf.py
Switch Ubuntu to runlevel 3 to free up some memory before the tests.
sudo systemctl isolate multi-user.target
Running the code normally on CUDA, the training part took: 43s:
(linear_regression_tf.py:8106): Gdk-CRITICAL **: 21:21:38.160: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
(linear_regression_tf.py:8106): Gdk-CRITICAL **: 21:21:38.164: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-03-27 21:21:41.201561: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-03-27 21:21:41.202159: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x3f3f6740 executing computations on platform Host. Devices:
2019-03-27 21:21:41.202218: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): <undefined>, <undefined>
2019-03-27 21:21:41.289107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:965] ARM64 does not support NUMA - returning NUMA node zero
2019-03-27 21:21:41.289392: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x3d4e3d50 executing computations on platform CUDA. Devices:
2019-03-27 21:21:41.289441: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-03-27 21:21:41.289806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.86GiB freeMemory: 1.94GiB
2019-03-27 21:21:41.289863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-27 21:21:42.374481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-27 21:21:42.374562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-27 21:21:42.374592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-27 21:21:42.374771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1518 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2019-03-27 21:21:43.782819: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10.0 locally
Training reulst: W=4.765114, b=2.123778
Time taken for learning: 43.052430391311646
Pearson R^2: 0.994371
RMS: 0.119845
Running the same script on CPU by specifying the environment variable to hide CUDA:
CUDA_VISIBLE_DEVICES="" python3 linear_regression_tf.py
It took 25s, nearly half the time, to complete the training.
(linear_regression_tf.py:8295): Gdk-CRITICAL **: 21:22:35.522: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
(linear_regression_tf.py:8295): Gdk-CRITICAL **: 21:22:35.526: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-03-27 21:22:38.583032: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-03-27 21:22:38.584033: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x9774bd0 executing computations on platform Host. Devices:
2019-03-27 21:22:38.584096: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): <undefined>, <undefined>
2019-03-27 21:22:38.613701: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-03-27 21:22:38.613800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:148] kernel driver does not appear to be running on this host (clarence-jetson-nano): /proc/driver/nvidia/version does not exist
Training reulst: W=4.765114, b=2.123778
Time taken for learning: 25.613048553466797
Pearson R^2: 0.994371
RMS: 0.119845
Maybe the platform is not meant to be used for training?