Error: Local version of NVML doesn't implement this function

I have a Jetson AGX Orin and I want to train an SSD model for object detection inside a docker container. To do so, I have built a pytorch_nvidia docker image for aarch64 tegra compatible with the Orin L4T version, which was flashed with JetPack 5.0.2.
Host versions info:

  • Cuda version: 11.4
  • L4T R35.1.0

This is the Dockerfile:
Dockerfile.tegra (1.4 KB)

To train the model I use the main.py script from github NVIDIA/ DeepLearningExamples/PyTorch/Detection/SSD
I get the following error:

dlopen libnvidia-ml.so failed!. Please install GPU dirver[/opt/dali/dali/util/nvml_wrap.cc:69] nvmlInitChecked failed: 
Traceback (most recent call last):
  File "src/train.py", line 286, in <module>
    train(train_loop_func, logger, args)
  File "src/train.py", line 148, in train
    train_loader = get_train_loader(args, args.seed - 2**31)
  File "/workspace/pytorch_nvidia/src/ssd/data.py", line 40, in get_train_loader
    train_pipe.build()
  File "/usr/local/lib/python3.8/dist-packages/nvidia/dali/pipeline.py", line 861, in build
    self._pipe.Build(self._generate_build_args())
RuntimeError: nvml error (13): Local version of NVML doesn't implement this function

I also tried to modify the installation of DALI compiling from source for cuda version 11.4 but the error persists. Any ideas?

3 Likes

Hi,

Please try our container for Jetson below:

Thanks.

Take a look at the Dockerfile I attached, you will see that I already use the image you mentioned as the base image. So this does not solve my problem. Any further ideas?

1 Like

The problem remains unresolved. I have a project stuck because of this issue. Could you give me some ideas on how to fix it? Do you need any other information from me so that you can help me?

1 Like

Hi all, I’m a co-worker of @maria.mercade . We have started from 0 with another NVIDIA Orin to see if the problem was because the initial instalation of the Orin. Same error.