I want to profile the training of VGG model with 2 GPUs.
I used the suggested command, and the two GPU were used indeed.
root@5327232ca894:/data/learning/tf/nv-cnn# mpiexec --allow-run-as-root --bind-to socket -np 2 -x CUDA_VISIBLE_DEVICES=0,1 numactl -N 0 -m 0 python vgg.py --layers 16 -b 32 -u batch -i 100 --data_dir=/data/learning/tf/models/research/inception/inception/data/ILSVRC2012/ --log_dir=/data/learning/tmp/
...
2019-07-07 15:39:08.978308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 1
2019-07-07 15:39:09.569105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-07 15:39:09.569174: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-07-07 15:39:09.569185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-07-07 15:39:09.569444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11396 MB memory) -> physical GPU (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 6.0)
2019-07-07 15:39:09.572354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-07 15:39:09.572414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 1
2019-07-07 15:39:09.572425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: N
2019-07-07 15:39:09.572660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11396 MB memory) -> physical GPU (device: 1, name: Tesla P100-SXM2-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
[5327232ca894:22146] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[5327232ca894:22146] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
...
However I didn’t find any tracing about the second GPU. Is it because of the MPI execution method? How to correctly profile the environment?