GPUs hang when executing NIM docker container on a 4xA100

Description

When executing the NIM docker container with the following script, if only one gpu is available or a gpu is specified, it runs normally. If multiple gpus are available, it will hang and the gpu usage will be displayed as 100%.

Choose a container name for bookkeeping

export CONTAINER_NAME=llama3-8b-instruct

Choose a LLM NIM Image from NGC

export IMG_NAME=“nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0

Choose a path on your system to cache the downloaded models

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p “$LOCAL_NIM_CACHE”

Start the LLM NIM

docker run -it --rm --name=$CONTAINER_NAME
–runtime=nvidia
–gpus all
–shm-size=16GB
-e NGC_API_KEY
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache”
-u $(id -u)
-p 8000:8000
$IMG_NAME

Environment

TensorRT Version:
GPU Type: A100
Nvidia Driver Version: 535.161.07
CUDA Version: 12.2
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Is your error similar to this?
https://github.com/NVIDIA/nim-deploy/issues/18

Hi @yirunw ,
Please raise your concern on Issues · NVIDIA/nim-deploy · GitHub

Thanks