I built a DIGITS server.
Specifications include: 2 GPUs (NVIDIA GeForce 1080 Ti) and Intel i7-7700, 32GB memory
Three examples were followed.
The example of learning after adding MNIST dataset is normal.
But
The following example of learning DetectNet with COCO data
https://github.com/dusty-nv/jetson-inference/blob/master/docs/detectnet-training.md#detection-data-formatting-in-digits
In the example below, learning from DetectNEt with KITTI data
https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md
I encountered the following error during the learning process.
####################
Traceback (most recent call last):
File “/usr/lib/python2.7/dist-packages/gevent/greenlet.py”, line 534, in run
result = self._run (* self.args, ** self.kwargs)
File “/root/digits/digits/model/tasks/train.py”, line 219, in hw_socketio_updater
nvml_info = device_query.get_nvml_info (index)
File “/root/digits/digits/device_query.py”, line 252, in get_nvml_info
raise RuntimeError (‘nvmlInit () failed with error #% s’% rc)
RuntimeError: nvmlInit () failed with error # 15
<Greenlet at 0x7f8a715f5410: <bound method CaffeTrainTask.hw_socketio_updater of <digits.model.tasks.caffe_train.CaffeTrainTask object at 0x7f8a716c2790 >> ([‘0’, ‘1’]) failed with RuntimeError
#######################
What should I do?
I tried with one GPU and tried two, but it was the same error.
GPU is lost …? Do you know why?