Utilizing DIGITS, KITTI & COCO data with DetectNet

I built a DIGITS server.
Specifications include: 2 GPUs (NVIDIA GeForce 1080 Ti) and Intel i7-7700, 32GB memory

Three examples were followed.
The example of learning after adding MNIST dataset is normal.
But
The following example of learning DetectNet with COCO data
https://github.com/dusty-nv/jetson-inference/blob/master/docs/detectnet-training.md#detection-data-formatting-in-digits

In the example below, learning from DetectNEt with KITTI data
https://github.com/NVIDIA/DIGITS/blob/master/examples/object-detection/README.md

I encountered the following error during the learning process.
####################
Traceback (most recent call last):
  File “/usr/lib/python2.7/dist-packages/gevent/greenlet.py”, line 534, in run
    result = self._run (* self.args, ** self.kwargs)
  File “/root/digits/digits/model/tasks/train.py”, line 219, in hw_socketio_updater
    nvml_info = device_query.get_nvml_info (index)
  File “/root/digits/digits/device_query.py”, line 252, in get_nvml_info
    raise RuntimeError (‘nvmlInit () failed with error #% s’% rc)
RuntimeError: nvmlInit () failed with error # 15
<Greenlet at 0x7f8a715f5410: <bound method CaffeTrainTask.hw_socketio_updater of <digits.model.tasks.caffe_train.CaffeTrainTask object at 0x7f8a716c2790 >> ([‘0’, ‘1’]) failed with RuntimeError
#######################

What should I do?

I tried with one GPU and tried two, but it was the same error.
GPU is lost …? Do you know why?

Hi,

The message of nvmlInit() failed means your driver/cuda may not be installed properly. If you are running DIGITS container, can you reply with your container name (and its tag)? And in your host environment, please run nvidia-smi and tell us the result. Because the error is actually from nvml, not from DIGITS itself, we need to go deeper to get more information.

hi CALL-151

Thank you for the reply.
I found the cause.
The cause was a hardware problem. There was a bubble wrap around the gpu.
It is thought that gpu is dead due to gpu temperature rise.
Bubble wrap works normally after removal

1 Like