Run TAO training probelm

I tried to train the model using unet_isbi.ipynb, but encountered an initialization error. It couldn’t open a shared object file named libnvidia-ml.so.l. I’m curious if there’s something I might have missed downloading.

By the way, i am using cs_samples_v1.4.1.
which container should i use?

May I know which device are you running on?
Jetson device? dgpu machine? Cloud? WSL?

i run on dgpu machine(nvidia v2), and my version is ubuntu 22.04.

Could you share the result of
$nvidia-smi

This notebook are old. Could you use the latest notebook?
See TAO Toolkit Quick Start Guide - NVIDIA Docs.

And please check the requirement installation: TAO Toolkit Quick Start Guide - NVIDIA Docs
and
TAO Toolkit Quick Start Guide - NVIDIA Docs

Ubuntu22.04 should be also working.
The error you mentioned is mostly due to nvidia-container-toolkit

i already download the latest version of notebook, which named “getting_started_v5.3.0” , and also satisfy the requirements, included nvidia-container-toolket. Unfortunately, the problem above still shows up.

i am using this container, named “nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5”, running on virtualenv by the way.

To narrow down, can you run below successfully?
Open a terminal, then,
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

is my container’s version correct?

The nvidia-docker2 is not installed.

Please

$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo pkill -SIGHUP dockerd
$ sudo systemctl restart docker.service

but when i tried to run nvidia-docker --version, it still shows command not found, any suggestion?

Can you uninstall it and install again?

i ran sudo apt-get remove --purge nvidia-docker2
and sudo apt-get autoremove to uninstall nvidia-docker2, but it shows the same issue.


Please create a new launcher and retry. Refer to TAO Toolkit Quick Start Guide - NVIDIA Docs.

i am confused, is this message correct?

i retry all of the instruction in the link you provided me, but the problem didn’t resolve.

Can you
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash


it still shows the same error.

Please install nvidia-docker2 again under this (base) launcher.

I have tried to install nvidia-docker2 by the instruction above but still get the same result.