Error? jupyter tutorial for SSD object detection using Transfer Learning Toolkit


I tried to transter learn.

so environment was established using the address below.

[ Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC ]

While using this example,

[ /workspace/examples/ssd/ssd.ipynb ]

Here an error occurred.

[ step 6. Retrain pruned models ]

Using TensorFlow backend.
Using TensorFlow backend.
2020-04-07 03:43:22.042605: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-07 03:43:22.042605: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
  File "./common/", line 32, in main
  File "./ssd/scripts/", line 36, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
2020-04-07 03:43:22.129983: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-07 03:43:22.131279: I tensorflow/compiler/xla/service/] XLA service 0x6f14570 executing computations on platform CUDA. Devices:
2020-04-07 03:43:22.131300: I tensorflow/compiler/xla/service/]   StreamExecutor device (0): TITAN Xp, Compute Capability 6.1
2020-04-07 03:43:22.132997: I tensorflow/core/platform/profile_utils/] CPU Frequency: 3600000000 Hz
2020-04-07 03:43:22.133303: I tensorflow/compiler/xla/service/] XLA service 0x702fbe0 executing computations on platform Host. Devices:
2020-04-07 03:43:22.133323: I tensorflow/compiler/xla/service/]   StreamExecutor device (0): <undefined>, <undefined>
2020-04-07 03:43:22.133440: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 11.55GiB
2020-04-07 03:43:22.133457: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2020-04-07 03:43:22.133969: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-07 03:43:22.133982: I tensorflow/core/common_runtime/gpu/]      0 
2020-04-07 03:43:22.133990: I tensorflow/core/common_runtime/gpu/] 0:   N 
2020-04-07 03:43:22.134047: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11235 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-04-07 03:43:22,135 [INFO] iva.ssd.scripts.train: Loading experiment spec at /workspace/examples/ssd/specs/ssd_retrain_resnet18_kitti.txt.
2020-04-07 03:43:22,136 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from /workspace/examples/ssd/specs/ssd_retrain_resnet18_kitti.txt
WARNING:tensorflow:From ./detectnet_v2/dataloader/ tf_record_iterator (from is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
2020-04-07 03:43:22,141 [WARNING] tensorflow: From ./detectnet_v2/dataloader/ tf_record_iterator (from is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/ colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-04-07 03:43:22,193 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/ colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23593,1],1]
  Exit code:    1

I didn’t modify anything.

What more should we do?


How many gpus in your PC?

only 1(Titan XP)

| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  TITAN Xp            Off  | 00000000:01:00.0 Off |                  N/A |
| 23%   30C    P8     9W / 250W |    210MiB / 12192MiB |      1%      Default |
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0      1156      G   /usr/lib/xorg/Xorg                            84MiB |
|    0      1701      G   /usr/bin/gnome-shell                         122MiB |

Modifying the number of GPU makes it work as normal

Thank you