ValueError: this machine only has: ['/cpu:0', '/gpu:0']

Hi, everyone!
I want to use keras with multi-gpus to train network in PX2. But i meet the the ValueError,

ValueError: To call `multi_gpu_model` with `gpus=2`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/gpu:0']. Try reducing `gpus`.
Using TensorFlow backend.
nvrm_gpu: Bug 200215060 workaround enabled.
2019-03-15 08:06:43.803666: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2019-03-15 08:06:43.804020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: Graphics Device major: 6 minor: 1 memoryClockRate(GHz): 1.29
pciBusID: 0000:04:00.0
totalMemory: 3.75GiB freeMemory: 3.68GiB
2019-03-15 08:06:43.901551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:865] ARM64 does not support NUMA - returning NUMA node zero
2019-03-15 08:06:43.901759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 1 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275
pciBusID: 0000:00:00.0
totalMemory: 6.51GiB freeMemory: 3.52GiB
2019-03-15 08:06:43.901883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1227] Device peer to peer matrix
2019-03-15 08:06:43.903048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1233] DMA: 0 1 
2019-03-15 08:06:43.903098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 0:   Y N 
2019-03-15 08:06:43.903141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1243] 1:   N Y 
2019-03-15 08:06:43.903226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1297] Ignoring visible gpu device (device: 1, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2019-03-15 08:06:43.903274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2019-03-15 08:06:46.437915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3411 MB memory) -> physical GPU (device: 0, name: Graphics Device, pci bus id: 0000:04:00.0, compute capability: 6.1)
Create YOLOv3 model with 9 anchors and 13 classes.
Traceback (most recent call last):
  File "train.py", line 199, in <module>
    _main()
  File "train.py", line 36, in _main
    parallel_model=multi_gpu_model(model,gpus=2)
  File "/usr/local/lib/python3.5/dist-packages/keras/utils/training_utils.py", line 138, in multi_gpu_model
    available_devices))
ValueError: To call `multi_gpu_model` with `gpus=2`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1']. However this machine only has: ['/cpu:0', '/gpu:0']. Try reducing `gpus`.

Dear limengyang1995,

DrivePX2 is good platform for DL inference not for DL training.
Could you please help to check the iGPU/dGPU status with deviceQuery command?

nvidia@tegra-ubuntu:~/NVIDIA_CUDA-9.2_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

<b>Detected 2 CUDA Capable device(s)</b>

<b>Device 0: "DRIVE PX 2 AutoChauffeur"</b>
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 3840 MBytes (4026466304 bytes)
  ( 9) Multiprocessors, (128) CUDA Cores/MP:     1152 CUDA Cores
  GPU Max Clock rate:                            1290 MHz (1.29 GHz)
  Memory Clock rate:                             3003 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

<b>Device 1: "NVIDIA Tegra X2"</b>
  CUDA Driver Version / Runtime Version          9.2 / 9.2
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 6402 MBytes (6712545280 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1275 MHz (1.27 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
[b]  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from DRIVE PX 2 AutoChauffeur (GPU0) -> NVIDIA Tegra X2 (GPU1) : No
> Peer access from NVIDIA Tegra X2 (GPU1) -> DRIVE PX 2 AutoChauffeur (GPU0) : No[/b]

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 2
Result = PASS

Dear SteveNV,

Thanks for your reply!

There are some other questions confuse me!

(1)How can I know the usage rate of igpu/dgpu?

(2)Can I run tensorrt with python in px2? I can not find the tensorrt python package supported arm64.

Dear limengyang1995,

Please refer to the following link for you questions.

  1. https://devtalk.nvidia.com/default/topic/1046662/general/tensorflow-not-using-gpu-on-drive-px2/

  2. https://devtalk.nvidia.com/default/topic/1048162/general/how-can-i-create-a-custom-tf-split-tensorrt-layer-to-run-yolo-v3-tiny-on-px2-with-driveworks-/

Dear limengyang1995,
Adding to Steve comments,

  1. How can I know the usage rate of igpu/dgpu?

We have Tegrastats tool show usage of iGPU and dGPU. on Drive PX2, it has an issue by which it only shows iGPU utilization and fix was released in our last release Drive release(Drive 8.0) which is tagteted towards Drive AGX paltform. So, On DrivePX2, tegrastats does not show dGPU utilization correctly. Please follow the thread https://devtalk.nvidia.com/default/topic/1036238/general/can-t-detect-the-dgpu-utilization-through-tegrastat/

  1. Can I run tensorrt with python in px2

No. On Drive PX2, only TensorRT C++ API is supported.