Failed to attach MIG instance to container
- 3 GPU Server attached succesfully, however only 1 GPU Server failed to attach
OS : Ubuntu 18.04.3 LTS
GPU Driver Version : 460.106.00
Containerd
nvidia-container-runtime -v
runc version 1.0.0-rc95
spec: 1.0.2-dev
go: go1.14.15
libseccomp: 2.5.1
Error
ctr: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: 0:0: unknown device: unknown
MIG are recognized well on OS via nvidia-smi, as shown below
GPU 0: A100-SXM-80GB (UUID: GPU-a8ca5655-0d82-ecda-644b-99b125e184d5)
MIG 1g.10gb Device 0: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/7/0)
MIG 1g.10gb Device 1: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/8/0)
MIG 1g.10gb Device 2: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/9/0)
MIG 1g.10gb Device 3: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/11/0)
MIG 1g.10gb Device 4: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/12/0)
MIG 1g.10gb Device 5: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/13/0)
failed server status via nvidia-container-cli list
root@ske-cl-mlops-g-6b9d88bdc9-87tlp:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/dev/nvidia1
/dev/nvidia2
/dev/nvidia3
/dev/nvidia4
/dev/nvidia5
/dev/nvidia6
/dev/nvidia7
## succedded to attach MIG on GPU server
root@ske-cl-a100-dev-7d47d6dd88-9ck6x:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi7/ci0/access
/dev/nvidia-caps/nvidia-cap66
/dev/nvidia-caps/nvidia-cap67
/proc/driver/nvidia/capabilities/gpu0/mig/gi8/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi8/ci0/access
/dev/nvidia-caps/nvidia-cap75
/dev/nvidia-caps/nvidia-cap76
/proc/driver/nvidia/capabilities/gpu0/mig/gi9/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi9/ci0/access
/dev/nvidia-caps/nvidia-cap84
/dev/nvidia-caps/nvidia-cap85
/proc/driver/nvidia/capabilities/gpu0/mig/gi10/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi10/ci0/access
/dev/nvidia-caps/nvidia-cap93
/dev/nvidia-caps/nvidia-cap94
/proc/driver/nvidia/capabilities/gpu0/mig/gi11/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi11/ci0/access
/dev/nvidia-caps/nvidia-cap102
/dev/nvidia-caps/nvidia-cap103
/proc/driver/nvidia/capabilities/gpu0/mig/gi12/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi12/ci0/access
/dev/nvidia-caps/nvidia-cap111
/dev/nvidia-caps/nvidia-cap112
/proc/driver/nvidia/capabilities/gpu0/mig/gi13/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi13/ci0/access
/dev/nvidia-caps/nvidia-cap120
/dev/nvidia-caps/nvidia-cap121
Description
A clear and concise description of the bug or issue.
Environment
TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered