TensorFlow performance

Anyone has experience with TensorFlow on Nano? How is the performance? I just tried a simple script and it runs nearly twice as fast on CPU when compared with running it on CUDA.

I followed the steps to install TensorFlow: https://docs.nvidia.com/deeplearning/dgx/install-tf-xavier/index.html (similar to the steps mentioned in another topic here)

Then I tried to run this sample, which is modified from from the book “TensorFlow for Deep Learning Companion Code”
https://github.com/kitsook/dlwithtf/blob/fix-lost-function/ch3/linear_regression_tf.py

Switch Ubuntu to runlevel 3 to free up some memory before the tests.

sudo systemctl isolate multi-user.target

Running the code normally on CUDA, the training part took: 43s:

(linear_regression_tf.py:8106): Gdk-CRITICAL **: 21:21:38.160: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed

(linear_regression_tf.py:8106): Gdk-CRITICAL **: 21:21:38.164: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-03-27 21:21:41.201561: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-03-27 21:21:41.202159: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x3f3f6740 executing computations on platform Host. Devices:
2019-03-27 21:21:41.202218: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-27 21:21:41.289107: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:965] ARM64 does not support NUMA - returning NUMA node zero
2019-03-27 21:21:41.289392: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x3d4e3d50 executing computations on platform CUDA. Devices:
2019-03-27 21:21:41.289441: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3
2019-03-27 21:21:41.289806: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: NVIDIA Tegra X1 major: 5 minor: 3 memoryClockRate(GHz): 0.9216
pciBusID: 0000:00:00.0
totalMemory: 3.86GiB freeMemory: 1.94GiB
2019-03-27 21:21:41.289863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-27 21:21:42.374481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-27 21:21:42.374562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-27 21:21:42.374592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-27 21:21:42.374771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1518 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3)
2019-03-27 21:21:43.782819: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10.0 locally
Training reulst: W=4.765114, b=2.123778
Time taken for learning: 43.052430391311646
Pearson R^2: 0.994371
RMS: 0.119845

Running the same script on CPU by specifying the environment variable to hide CUDA:

CUDA_VISIBLE_DEVICES="" python3 linear_regression_tf.py

It took 25s, nearly half the time, to complete the training.

(linear_regression_tf.py:8295): Gdk-CRITICAL **: 21:22:35.522: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed

(linear_regression_tf.py:8295): Gdk-CRITICAL **: 21:22:35.526: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-03-27 21:22:38.583032: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-03-27 21:22:38.584033: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x9774bd0 executing computations on platform Host. Devices:
2019-03-27 21:22:38.584096: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-27 21:22:38.613701: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-03-27 21:22:38.613800: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:148] kernel driver does not appear to be running on this host (clarence-jetson-nano): /proc/driver/nvidia/version does not exist
Training reulst: W=4.765114, b=2.123778
Time taken for learning: 25.613048553466797
Pearson R^2: 0.994371
RMS: 0.119845

Maybe the platform is not meant to be used for training?

Hi,

Jetson platform is designed for fast inference so it’s not recommended to use for training.
If you are looking for an AI benchmark for Nano, please check this blog:
[url]https://devblogs.nvidia.com/jetson-nano-ai-computing/[/url]

Thanks.