Surprised at how slow Xavier is on training small regression model compared to x86 with no GPU Maybe something wrong?

Hi, I just got my Xavier up and running and wanted to test it with a very simple and small regression learning problem using Keras and Tensorflow. This is just training on 264 data points with a 4 k-fold validation. Only 3 64-node relu layers and an output layer.

I got the Xavier because I thought the GPU’s would be much faster than a simple x86 processor with no GPUs. However, the modeling time on the x86 for this problem is 8.06s whereas the Xavier is taking 32.17s for the exact same operation. Does this sound odd? I’m running the Xavier in npvmodel -m 0 mode.

In addition, Tensorflow is generating some additional console output on the Xavier that I have not seen before, maybe this is a clue as to what is happening.

Any ideas or similar experience from anyone?

Thanks!

Scott

output:
nvidia@jetson-0423018054460:~/Desktop/TestTensorflow$ python3 Test_Tensorflow.py
Using TensorFlow backend.
processing k-fold # 0
2019-01-03 20:36:25.755075: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] ARM64 does not support NUMA - returning NUMA node zero
2019-01-03 20:36:25.755304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Xavier major: 7 minor: 2 memoryClockRate(GHz): 1.5
pciBusID: 0000:00:00.0
totalMemory: 15.45GiB freeMemory: 10.41GiB
2019-01-03 20:36:25.755360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-03 20:36:26.431348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-03 20:36:26.431494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-03 20:36:26.431539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-03 20:36:26.431793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9895 MB memory) → physical GPU (device: 0, name: Xavier, pci bus id: 0000:00:00.0, compute capability: 7.2)
processing k-fold # 1
processing k-fold # 2
processing k-fold # 3
modeling duration = 32.177246 s

Hi sidener2002, I’m not familiar with this code in particular, so maybe another poster has input on the TensorFlow side, however here are some general observations: