L4T Docker Cuda

Hi!

So I’m setting up a deployment pipeline using docker for an application running on Xavier NX. Preferably I want to build an image on some server somewhere and push it to my Xaviers.

I found the guide using QEMU on your github and so far so good. I’m starting from the L4T PyTorch image (nvcr.io/nvidia/l4t-pytorch:r32.4.4-pth1.6-py3)

The problem is that I’m using the pose_trt repo and to install it (which I want to do while building the image) the setup.py file needs to import torch. Which gives the following error:

Traceback (most recent call last): File "setup.py", line 2, in <module> from torch.utils import cpp_extension File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 188, in <module> _load_global_deps() File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 141, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcurand.so.10: cannot open shared object file: No such file or directory

As I understand it this would have something to do with the container using cuda from the host system?

Can this be solved somehow? I’ve install the cuda cross-compile package using the sdk manager.

Edit: Turns out it does not build on an xavier box either. It fails at the same step (here is the relevant part of the dockerfile):
WORKDIR ~
RUN pip3 install tqdm cython pycocotools
RUN apt-get install -y python3-matplotlib
RUN git clone https://github.com/NVIDIA-AI-IOT/trt_pose
RUN cd trt_pose && python3 setup.py install

But that is a bit strange since I can from torch.utils import cpp_extension no problem if I just run from the shell in the intermediate container. And the file libcurand.so.10 definitely does exist.

Edit2:

I don’t know if I’m doing something wrong but if I just run the base L4T-Pytorch image, I can install trt_pose just fine using the commandline. So why not during the docker build process?

Hi Oscar,

I ran across your post and am having the same issue. I am running a docker container and then attempting to run pytorch

root@6481d7e0356e:/# python3
Python 3.6.9 (default, Oct  8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 188, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 141, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory

We have tracked the issue down to the /usr/local/cuda-10.2 not being mounted from the host into the docker volume. You can check this by enabling debug (in /etc/nvidia-container-runtime/config.toml) and seeing that the /usr/local/cuda-10.2 in the list (or not) of mounted files (search for “mount jetson dirs”) in the log ‘/var/log/nvidia-container-toolkit.log’.

In a good case the log should so something like this

nvc_mount.c:495] mount jetson dirs
jetson_mount.c:50] mounting /lib/firmware/tegra19x at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/lib/firmware/tegra19x
jetson_mount.c:50] mounting /lib/firmware/tegra18x at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/lib/firmware/tegra18x
jetson_mount.c:50] mounting /usr/local/cuda-10.2 at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/local/cuda-10.2
jetson_mount.c:50] mounting /usr/lib/python2.7/dist-packages/tensorrt at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/lib/python2.7/dist-packages/tensorrt
jetson_mount.c:50] mounting /usr/lib/python2.7/dist-packages/graphsurgeon at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/lib/python2.7/dist-packages/graphsurgeon
jetson_mount.c:50] mounting /usr/lib/python2.7/dist-packages/uff at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/lib/python2.7/dist-packages/uff
jetson_mount.c:50] mounting /usr/lib/python3.6/dist-packages/tensorrt at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/lib/python3.6/dist-packages/tensorrt
jetson_mount.c:50] mounting /usr/lib/python3.6/dist-packages/graphsurgeon at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/lib/python3.6/dist-packages/graphsurgeon
jetson_mount.c:50] mounting /usr/lib/python3.6/dist-packages/uff at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/lib/python3.6/dist-packages/uff
jetson_mount.c:50] mounting /usr/src/tensorrt at /var/lib/docker/overlay2/2145d7e1787bbfb739d2f2c6231db39d4aed7be136b541dc88a8b755b2750418/merged/usr/src/tensorrt
nvc_mount.c:503] mount libraries32

in the bad case

nvc_mount.c:495] mount jetson dirs
jetson_mount.c:50] mounting /lib/firmware/tegra19x at /media/plugin-data/docker/overlay2/9ddc137b81aeea2f46a3d6fa4a8637f4b227b45882ecdaf8cc5253b7556e4c92/merged/lib/firmware/tegra19x
jetson_mount.c:50] mounting /lib/firmware/tegra18x at /media/plugin-data/docker/overlay2/9ddc137b81aeea2f46a3d6fa4a8637f4b227b45882ecdaf8cc5253b7556e4c92/merged/lib/firmware/tegra18x
nvc_mount.c:503] mount libraries32

What is most interesting is that the list of folders that are mounted depends on their existence in the system the first time that the nvidia runtime is used (available by apt installing nvidia-docker2). Meaning if the directory /usr/local/cuda-10.2 does not exist when the docker runtime is first used it will never be mounted. No number of reboots or even re-installation of nvidia-docker2 fixed this issue. I discovered this after re-flashing my NVidia NX and installing all dependencies before installing nvidia-docker2 and using the nvidia docker runtime for the first time.

Fyi, I am still in the process of trying to resolve this issue and have not fully validated my theory above.

Okay we figured out how the directories are mounted. They are driven by CSV files in /etc/nvidia-container-runtime/host-files-for-container.d. See: https://github.com/NVIDIA/libnvidia-container/blob/jetson/design/mount_plugins.md

Here are examples of the Debian packages required to get the CSV files:

root@nx:/etc/nvidia-container-runtime/host-files-for-container.d# ls | xargs -n 1 sh -c 'dpkg -S $0'
nvidia-container-csv-cuda: /etc/nvidia-container-runtime/host-files-for-container.d/cuda.csv
nvidia-container-csv-cudnn: /etc/nvidia-container-runtime/host-files-for-container.d/cudnn.csv
nvidia-l4t-init: /etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv
nvidia-container-csv-tensorrt: /etc/nvidia-container-runtime/host-files-for-container.d/tensorrt.csv
nvidia-container-csv-visionworks: /etc/nvidia-container-runtime/host-files-for-container.d/visionworks.csv

The contents of the files outline the mounting rules for the nvidia runtime

example

root@linux:/etc/nvidia-container-runtime/host-files-for-container.d# cat cuda.csv
dir, /usr/local/cuda-10.2
sym, /usr/lib/aarch64-linux-gnu/libcublasLt.so
sym, /usr/lib/aarch64-linux-gnu/libnvblas.so
sym, /usr/lib/aarch64-linux-gnu/libcublas.so
lib, /usr/lib/aarch64-linux-gnu/libcublas.so.10.2.2.89
lib, /usr/lib/aarch64-linux-gnu/libcublasLt.so.10.2.2.89
lib, /usr/lib/aarch64-linux-gnu/libnvblas.so.10.2.2.89
sym, /usr/lib/aarch64-linux-gnu/libcublasLt.so.10
sym, /usr/lib/aarch64-linux-gnu/libcublas.so.10
lib, /usr/include/cublas_api.h
lib, /usr/include/cublasLt.h
lib, /usr/include/cublasXt.h
lib, /usr/include/cublas.h
lib, /usr/include/cublas_v2.h
sym, /usr/loca/cuda

Interesting, thanks for your reply Joseph! For me the problem was not having nvidia-docker set as the default runtime in daemon.json. So I can build on the Xavier. But still have not figured out how to build on the x86 system, which is what I really want.