Libcurand.so.10 not found on JetPack 4.6.2 in docker

I would like to use (or build) PyTorch with CUDA in docker container, but it seems CUDA files are not mounted from host.

$ docker run --gpus all --rm -it --network host nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3
root@agx:/# python3 -c 'import torch'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
root@agx:/# ls /usr/local/cuda-10.2/targets/aarch64-linux/lib/
libcudadevrt.a  libcudart_static.a  stubs
root@agx:/#

in host:

$ find /usr -name libcurand.so.*
/usr/local/cuda-10.2/doc/man/man7/libcurand.so.7
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcurand.so.10
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcurand.so.10.1.2.300

I also try to use nvcr.io/nvidia/l4t-base:r32.7.1 to build Pytorch, but it also doesn’t have CUDA libraries.

# ls /usr/local/cuda-10.2/targets/aarch64-linux/lib/
libcudadevrt.a  libcudart_static.a  stubs
  • Is it correct to use r32.7.1 images for JetPack 4.6.2 to use CUDA?
  • Is there any docker base image with CUDA for JetPack 4.6.2, or should I reinstall other version of JetPack to use CUDA from docker image?

Any other way to use or build any version of PyTorch in docker container would be appreciated.

Hi @fujii5, I think you mean JetPack 4.6.1 (L4T R32.7.1), and yes the r32.7.1 docker images are the right ones to use with JetPack 4.6.1

On JetPack 4.x, CUDA/cuDNN/TensorRT/ect are mounted from your device into the container when --runtime nvidia is used to start the container. On JetPack 5, CUDA/ect are installed inside the container.

I noticed you were using the --gpus all flag, can you try running it like this instead:

$ sudo docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3

If that still doesn’t work, can you check that you have these CSV files on your device?

$ ls -ll /etc/nvidia-container-runtime/host-files-for-container.d/
total 32
-rw-r--r-- 1 root root    26 May 23  2021 cuda.csv
-rw-r--r-- 1 root root  4250 Jul 13  2021 cudnn.csv
-rw-r--r-- 1 root root 12240 Feb  2 16:30 l4t.csv
-rw-r--r-- 1 root root  1590 Jan 14 04:44 tensorrt.csv
-rw-r--r-- 1 root root   325 Aug 11  2020 visionworks.csv

@dusty_nv , thanks for your response.

I’m using JetPack 4.6.2 (R32.7.2).
As there is no r32.7.2 tag for l4t-pytorch or l4t-base, I’m currently using r32.7.1 images instead.

$ cat /etc/nv_tegra_release
# R32 (release), REVISION: 7.2, GCID: 30192233, BOARD: t186ref, EABI: aarch64, DATE: Sun Apr 17 09:53:50 UTC 2022

Thanks, but it still doesn’t work:

$ sudo docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py
3
root@agx:/# python3 -c 'import torch'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
root@agx:/# 

CSV files exist on host:

$ ls -ll /etc/nvidia-container-runtime/host-files-for-container.d/
total 32
-rw-r--r-- 1 root root    26 May 24  2021 cuda.csv
-rw-r--r-- 1 root root  4250 Jul 13  2021 cudnn.csv
-rw-r--r-- 1 root root 12240 Apr 17 09:49 l4t.csv
-rw-r--r-- 1 root root  1590 Jan 14 09:44 tensorrt.csv
-rw-r--r-- 1 root root   325 Aug 11  2020 visionworks.csv

OK, the r32.7.1 images should still work on r32.7.2.

Can you check another thing, if you run the following does it work?

$ sudo docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-base:r32.7.1
# python3 -c 'import tensorrt'

If that doesn’t work either, it would seem there is something wrong with your NVIDIA Container Runtime, and you should either re-install that through apt, or you may just want to re-flash the device if you continue having problems with it.

1 Like

libcurand.so.10 came from the package libcurand-10-2. Do you have it installed on the device before you start the container?

$ apt list --installed | grep libcurand-10-2

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libcurand-10-2/stable,now 10.1.2.300-1 arm64 [installed]

It doesn’t work, even after I re-installed nvidia-container-runtime by sudo apt remove nvidia-container-runtime ; sudo apt install nvidia-container-runtime.
I’ll try re-flash the device.

Yes, I have libcurand on host:

$ apt list --installed | grep libcurand-10-2

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libcurand-10-2/stable,now 10.1.2.300-1 arm64 [installed,automatic]

Since your host has the file /etc/nvidia-container-runtime/host-files-for-container.d/cuda.csv that has the content (please double check yours):

dir, /usr/local/cuda-10.2

This means the container /usr/local/cuda-10.2 should be mapped from the host, and has exactly same content.

Since your host has /usr/local/cuda-10.2/targets/aarch64-linux/lib/libcurand.so.10, your container should have it too.

My cuda.csv has the same content:

$ cat /etc/nvidia-container-runtime/host-files-for-container.d/cuda.csv
dir, /usr/local/cuda-10.2

But it seems to be failed to map to container:

$ ls /usr/local/cuda-10.2/
EULA.txt  doc     include  nvml  nvvmx    share    tools         version.txt
bin       extras  lib64    nvvm  samples  targets  version.json
$ docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-base:r32.7.1
root@agx:/# ls /usr/local/cuda-10.2/
bin  include  lib64  nvvm  nvvmx  targets

What docker and container are installed?

apt list --installed | grep docker
apt list --installed | grep container

(JetPack 4.6.1)

Also, what happens if you force mount?

docker run -it --rm --runtime nvidia --network host -v /usr/local/cuda-10.2/:/usr/local/cuda-10.2/:ro nvcr.io/nvidia/l4t-base:r32.7.1

python3 -c 'import torch'

Some packages are newer version than your JetPack 4.6.1:

$ apt list --installed | grep docker

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

docker/bionic,now 1.5-1build1 arm64 [installed]
docker.io/bionic-updates,bionic-security,now 20.10.7-0ubuntu5~18.04.3 arm64 [installed]
nvidia-docker2/bionic,now 2.10.0-1 all [installed]
$ apt list --installed | grep container

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

containerd/bionic-updates,bionic-security,now 1.5.5-0ubuntu3~18.04.2 arm64 [installed,automatic]
libnvidia-container-tools/bionic,now 1.9.0-1 arm64 [installed]
libnvidia-container0/bionic,now 0.11.0+jetpack arm64 [installed]
libnvidia-container1/bionic,now 1.9.0-1 arm64 [installed]
nvidia-container-csv-cuda/stable,now 10.2.460-1 arm64 [installed]
nvidia-container-csv-cudnn/stable,now 8.2.1.32-1+cuda10.2 arm64 [installed]
nvidia-container-csv-tensorrt/stable,now 8.2.1.8-1+cuda10.2 arm64 [installed]
nvidia-container-csv-visionworks/stable,now 1.6.0.501 arm64 [installed]
nvidia-container-runtime/bionic,now 3.9.0-1 all [installed]
nvidia-container-toolkit/bionic,now 1.9.0-1 arm64 [installed]

libcudnn.so.8 is also needed:

$ docker run -it --rm --runtime nvidia --network host -v /usr/local/cuda-10.2/:/usr/local/cuda-10.2/:ro nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3
root@agx:/# python3 -c 'import torch'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /usr/lib/aarch64-linux-gnu/libcudnn.so.8: file too short

After I force mount /usr/lib/aarch64-linux-gnu too, it seems that torch is available!

$ docker run -it --rm --runtime nvidia --network host -v /usr/local/cuda-10.2/:/usr/local/cuda-10.2/:ro -v /usr/lib/aarch64-linux-gnu/:/usr/lib/aarch64-linux-gnu nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3
root@agx:/# python3 -c 'import torch; print(torch.cuda.is_available())'
True

But when I add read-only option to /usr/lib/aarch64-linux-gnu, it fails to run container:

$ docker run -it --rm --runtime nvidia --network host -v /usr/local/cuda-10.2/:/usr/local/cuda-10.2/:ro -v /usr/lib/aarch64-linux-gnu/:/usr/lib/aarch64-linux-gnu:ro nvcr.io/nvidia/l4t-pytorch:r32.7.1-pth1.9-py3
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: src: /etc/vulkan/icd.d/nvidia_icd.json, src_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json, dst: /mnt/m2ssd/docker/overlay2/2150184e7577a3a38c5cade12e51d8fab5da5ca67e7bad9f0e0d39527d396eac/merged/etc/vulkan/icd.d/nvidia_icd.json, dst_lnk: /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json
src: /usr/lib/aarch64-linux-gnu/libcuda.so, src_lnk: tegra/libcuda.so, dst: /mnt/m2ssd/docker/overlay2/2150184e7577a3a38c5cade12e51d8fab5da5ca67e7bad9f0e0d39527d396eac/merged/usr/lib/aarch64-linux-gnu/libcuda.so, dst_lnk: tegra/libcuda.so
, stderr: nvidia-container-cli: mount error: stat failed: /usr/lib/python3.6/dist-packages/onnx_graphsurgeon: no such file or directory: unknown.

The libnvidia-container0/bionic,now 0.11.0+jetpack installed on your Jetson is released for JetPack 5.

Checking the SDKManager’s deb package download directory, sdkm_downloads/, JetPack 4.6.2 is the same as JetPack 4.6.1 libnvidia-container0_0.10.0+jetpack_arm64.deb and libnvidia-container-tools_1.7.0-1_arm64.deb seems to install.

I think you updated to the JetPack5 package based on some method, but as @dusty_nv wrote, after JetPack5, cuda and others will use what is installed inside the container.

On JetPack 5, CUDA/ect are installed inside the container.

I think re-flashing is the quickest way to solve this issue.
If you cannot re-flash, you can try either of the following

  1. get docker-related deb packages from sdkm_downloads/ and install them

or

  1. install the cuda package in the docker container (TensorRT probably cannot be installed, I don’t think it is distributed in deb).

However, I have not heard of success with either of these.
As for 1., I think I saw some failures with JetPack 4.5.x.
As for 2., I am not sure if it was around the time of JetPack 3.1 or before, but it is similar to the way it was done when the L4T kernel was just getting docker support. CUDA+Tensorflow was working.

@naisy
Thank you so much for your detailed explanation. I figured out what is happening.
I’m going to re-flash JetPack 4.6.2.

After I re-flashed JetPack 4.6.2 via SDKManager, CUDA libraries are correctly mounted to docker container!
Thank you all!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.