I have a Jetson AGX Orin and I want to train an SSD model for object detection inside a docker container. To do so, I have built a pytorch_nvidia docker image for aarch64 tegra compatible with the Orin L4T version, which was flashed with JetPack 5.0.2.
Host versions info:
- Cuda version: 11.4
- L4T R35.1.0
This is the Dockerfile:
Dockerfile.tegra (1.4 KB)
To train the model I use the main.py script from github NVIDIA/ DeepLearningExamples/PyTorch/Detection/SSD
I get the following error:
dlopen libnvidia-ml.so failed!. Please install GPU dirver[/opt/dali/dali/util/nvml_wrap.cc:69] nvmlInitChecked failed:
Traceback (most recent call last):
File "src/train.py", line 286, in <module>
train(train_loop_func, logger, args)
File "src/train.py", line 148, in train
train_loader = get_train_loader(args, args.seed - 2**31)
File "/workspace/pytorch_nvidia/src/ssd/data.py", line 40, in get_train_loader
train_pipe.build()
File "/usr/local/lib/python3.8/dist-packages/nvidia/dali/pipeline.py", line 861, in build
self._pipe.Build(self._generate_build_args())
RuntimeError: nvml error (13): Local version of NVML doesn't implement this function
I also tried to modify the installation of DALI compiling from source for cuda version 11.4 but the error persists. Any ideas?