MIG : Failed to attach MIG with container on specific 8 A100 GPU server on Ubuntu

Failed to attach MIG instance to container

  • 3 GPU Server attached succesfully, however only 1 GPU Server failed to attach

OS : Ubuntu 18.04.3 LTS
GPU Driver Version : 460.106.00
Containerd
nvidia-container-runtime -v
runc version 1.0.0-rc95
spec: 1.0.2-dev
go: go1.14.15
libseccomp: 2.5.1

Error
ctr: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: 0:0: unknown device: unknown

MIG are recognized well on OS via nvidia-smi, as shown below
GPU 0: A100-SXM-80GB (UUID: GPU-a8ca5655-0d82-ecda-644b-99b125e184d5)
MIG 1g.10gb Device 0: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/7/0)
MIG 1g.10gb Device 1: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/8/0)
MIG 1g.10gb Device 2: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/9/0)
MIG 1g.10gb Device 3: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/11/0)
MIG 1g.10gb Device 4: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/12/0)
MIG 1g.10gb Device 5: (UUID: MIG-GPU-a8ca5655-0d82-ecda-644b-99b125e184d5/13/0)

failed server status via nvidia-container-cli list

root@ske-cl-mlops-g-6b9d88bdc9-87tlp:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/dev/nvidia1
/dev/nvidia2
/dev/nvidia3
/dev/nvidia4
/dev/nvidia5
/dev/nvidia6
/dev/nvidia7

## succedded to attach MIG on GPU server

root@ske-cl-a100-dev-7d47d6dd88-9ck6x:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/proc/driver/nvidia/capabilities/gpu0/mig/gi7/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi7/ci0/access
/dev/nvidia-caps/nvidia-cap66
/dev/nvidia-caps/nvidia-cap67
/proc/driver/nvidia/capabilities/gpu0/mig/gi8/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi8/ci0/access
/dev/nvidia-caps/nvidia-cap75
/dev/nvidia-caps/nvidia-cap76
/proc/driver/nvidia/capabilities/gpu0/mig/gi9/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi9/ci0/access
/dev/nvidia-caps/nvidia-cap84
/dev/nvidia-caps/nvidia-cap85
/proc/driver/nvidia/capabilities/gpu0/mig/gi10/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi10/ci0/access
/dev/nvidia-caps/nvidia-cap93
/dev/nvidia-caps/nvidia-cap94
/proc/driver/nvidia/capabilities/gpu0/mig/gi11/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi11/ci0/access
/dev/nvidia-caps/nvidia-cap102
/dev/nvidia-caps/nvidia-cap103
/proc/driver/nvidia/capabilities/gpu0/mig/gi12/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi12/ci0/access
/dev/nvidia-caps/nvidia-cap111
/dev/nvidia-caps/nvidia-cap112
/proc/driver/nvidia/capabilities/gpu0/mig/gi13/access
/proc/driver/nvidia/capabilities/gpu0/mig/gi13/ci0/access
/dev/nvidia-caps/nvidia-cap120
/dev/nvidia-caps/nvidia-cap121

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

This forum talks more about updates and issues related to TensorRT.
Your issue doesn’t look TensorRT related.
We recommend you to reach out to the related forum to get better help.

Thank you.,