Hello
I tried to transter learn.
so environment was established using the address below.
[ Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC ]
While using this example,
[ /workspace/examples/ssd/ssd.ipynb ]
Here an error occurred.
[ step 6. Retrain pruned models ]
Using TensorFlow backend.
Using TensorFlow backend.
2020-04-07 03:43:22.042605: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-07 03:43:22.042605: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
File "/usr/local/bin/tlt-train-g1", line 8, in <module>
sys.exit(main())
File "./common/magnet_train.py", line 32, in main
File "./ssd/scripts/train.py", line 36, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1551, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 676, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 'visible_device_list' listed an invalid GPU id '1' but visible device count is 1
2020-04-07 03:43:22.129983: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-07 03:43:22.131279: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6f14570 executing computations on platform CUDA. Devices:
2020-04-07 03:43:22.131300: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): TITAN Xp, Compute Capability 6.1
2020-04-07 03:43:22.132997: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-04-07 03:43:22.133303: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x702fbe0 executing computations on platform Host. Devices:
2020-04-07 03:43:22.133323: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
2020-04-07 03:43:22.133440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.91GiB freeMemory: 11.55GiB
2020-04-07 03:43:22.133457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-04-07 03:43:22.133969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-07 03:43:22.133982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-04-07 03:43:22.133990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-04-07 03:43:22.134047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11235 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-04-07 03:43:22,135 [INFO] iva.ssd.scripts.train: Loading experiment spec at /workspace/examples/ssd/specs/ssd_retrain_resnet18_kitti.txt.
2020-04-07 03:43:22,136 [INFO] /usr/local/lib/python2.7/dist-packages/iva/ssd/utils/spec_loader.pyc: Merging specification from /workspace/examples/ssd/specs/ssd_retrain_resnet18_kitti.txt
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
2020-04-07 03:43:22,141 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-04-07 03:43:22,193 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py:358: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[23593,1],1]
Exit code: 1
--------------------------------------------------------------------------
I didn’t modify anything.
What more should we do?
Thanks