Triton Server can't run with GPU

Hello,

I am trying to deploy the models using the Triton Inference Server.

docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:22.09-py3 tritonserver --model-repository=/models

When I try to run the command from Triton Server Github to launch Triton container, I got the following error:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

If I run it without GPU, it works:

What is the possible reason of this problem?

Thanks!

Environment:
• Hardware Platform (Jetson / GPU)
GPU
• Triton Server Image
22.09-py3
• CUDA Version
12.0
• Docker Version
24.0.5

Please
$ sudo apt-get install -y nvidia-docker2
$ sudo apt install nvidia-driver-525

Hi,

thank you for your answer.

Both of them I had already installed, nvidia-docker2 version is 2.13.0-1. Driver version is 525.125.06.

Can you find the lib?
$ sudo find / -name libnvidia-ml.so.1

Please add --runtime=nvidia in the docker run command.

Yes, it can be found, here are output:

/usr/lib/i386-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
find: ‘/run/user/1000/doc’: Permission denied
find: ‘/run/user/1000/gvfs’: Permission denied
/var/snap/docker/common/var-lib-docker/overlay2/20cdacd0d96be0cd178f108fd419b3e05a943f4956de496986539a41e57d2cf3/diff/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1

When I add --runtime=nvidia, there is also an error:

docker: Error response from daemon: Unknown runtime specified nvidia.
See ‘docker run --help’.

Could you please double check if you install nvidia-docker on the machine?

Sure.

Please use the commands mentioned in Error while running action recognition net - #9 by Morganh and retry.

Hi Morganh,

I’ve retried these commands, they actually have been run before, so the issue still exists.

Can you try below and share the result?
$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

and
$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

The results are the same as before.

docker: Error response from daemon: Unknown runtime specified nvidia.
See ‘docker run --help’.

Please try to
sudo apt install -y nvidia-docker2
sudo systemctl daemon-reload
sudo systemctl restart docker

Refer to https://github.com/NVIDIA/nvidia-docker/issues/838

It still doesn’t work. As long as I run it with --runtime==nvidia, I get this error:

docker: Error response from daemon: Unknown runtime specified nvidia.
See ‘docker run --help’.

Without --runtime but run with --gpus all, get this:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Can you share /etc/nvidia-container-runtime/config.toml ?

Please try to reinstall nvidia-driver.

Uninstall:
sudo apt purge nvidia-driver-525
sudo apt autoremove
sudo apt autoclean

Install: sudo apt install nvidia-driver-525

Thank you for the answers, but it doesn’t seem to change anything after reinstall.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please try with New computer install GPU Docker error - #6 by david9xqqb, especially, sudo systemctl restart docker.service.