Suggestion to solve Tegra Nvidia-docker issues

So, right now on Tegra, to save space, nvidia-docker bind mounts a bunch of stuff from host to container and this is breaking things as the host goes out of sync. My proposal to Nvidia to fix this is as follows:

Just make it work like Nvidia docker on x86. Instead of sharing cuda, tensorrt, and so on, provide a minimal version of L4T set up to run containers – with just the drivers – bundling the cuda stuff inside a base image instead, which can come pre-installed.

Thanks to the way Docker works, that base image is then shared between all the other images, so there is no duplication (until update anyway) and the storage requirements should still be more or less the same – the caveat being that such a version of L4T should just be for running containers and nothing else. It doesn’t even have to be Ubuntu.

Hi,

Thanks for your question.

Beside nvidia-l4t-base, there is another docker image for Jetson called DeepStream-l4t.
In deepstream-l4t, we pre-installed CUDA, cuDNN, TensorRT and Deepstream library.

Is this what you expect? Sorry first if we miss something of your suggestion.

Thanks.

1 Like

Thanks, AastaLLL,

Such image size is nice, and the way it works by bind mouting a bunch of stuff to save space is kinda neat, but it isn’t consisent with x86, and there are other problems I see with this approach. I apologize in advance for the wall of text.

Problem 1 - host integrity

One of the limitations of the beta is that we are mounting the cuda directory from the host. This was done with size in mind as a development CUDA container weighs 3GB, on Nano it’s not always possible to afford such a huge cost. We are currently working towards creating smaller CUDA containers.

And it’s not just /usr/local/cuda. A whole bunch of things need to be mounted inside for it to work, and doing so is risky to the host if you, say, forget to append :ro to the docker run -v .... That’s real easy to do. For example, the documentation says /usr/local/cuda is mounted read only, but the run examples actually bypass that by omitting “:ro”.

Example:

[user@hostname] -- [/usr/local/cuda] 
 $ sudo docker run -it -v /usr/local/cuda:/usr/local/cuda --rm nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-samples
root@5d7d31bebf12:~# cd /usr/local/cuda
root@5d7d31bebf12:/usr/local/cuda# ls
LICENSE  NsightCompute-1.0  README  bin  doc  extras  include  lib64  nvml  nvvm  samples  share  targets  tools  version.txt
root@5d7d31bebf12:/usr/local/cuda# touch test
root@5d7d31bebf12:/usr/local/cuda# exit
[user@hostname] -- [/usr/local/cuda] 
 $ ls
bin  doc  extras  include  lib64  LICENSE  NsightCompute-1.0  nvml  nvvm  README  samples  share  targets  test  tools  version.txt

Yes root inside a container should be treated as root outside and “containers do not contain”, but one can imagine situations where it’s easy to accidentally break the system this way, either during image build or at runtime, if some process, like maybe apt, or somebody’s script, tries to write to some path like /usr/local/cuda.

This is read only:

 $ sudo docker run -it -v /usr/local/cuda:/usr/local/cuda:ro --rm nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-samples
root@9742f01e5be7:~# cd /usr/local/cuda
root@9742f01e5be7:/usr/local/cuda# touch test
touch: cannot touch 'test': Read-only file system

But that relies on :ro being appended, and nobody forgetting that, which is really easy to do. If there is a base image with those files, on the other hand, overlayfs ensures that you can modify /usr/local/cuda to your heart’s content and the original layers will still be intact.

I recognize the need to conserve space on Tegra which is why I suggested baking the base layer pre-installed image (eg. docker save, docker load) and not installing cuda on the host itself – rather just the drivers – as happens on x86. It’ll require a separate l4t version just for running containers, but those already exist, for Tegra too.

Problem 2 - consistency

Because of the way all this works, more has to be in sync between host and image than with x86 image where the driver just needs to be a minimum version to run a particular image (and cuda does not need to be installed on the host). On top of this, the same Dockerfile used for x86 has to be rewritten for Tegra (at a minium, the FROM line).

Ideally, I would like to be able to:

FROM nvcr.io/nvidia/deepstream:latest
...

… and build that on any NVIDIA platform. If there are unavoidable differences between architectures, I handle that with my build system the same way I do outside a container. That’s harder if things aren’t in consistent locations. For example on Tegra, the headers for deepstream are installed here:

 $ sudo docker run -it -v /usr/local/cuda:/usr/local/cuda:ro --rm nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-samples
[sudo] password for username: 
root@a373058f081b:~# cd /root/
root@a373058f081b:~# ls
deepstream_sdk_v4.0.2_jetson

on x86:

... docker run --rm -it  nvcr.io/nvidia/deepstream:4.0.2-19.12-devel
...
root@9808ccaa6dcd:/# cd /root/
root@9808ccaa6dcd:~# ls
deepstream_sdk_v4.0.2_x86_64

And when you install the debian package (at least on Tegra), the headers end up in
/opt/nvidia/deepstream/deepstream-4.0/sources/includes/ (instead of /usr/local/include or whatever). So there are at least three different locations for the headers depending on how you use deepstream.

If these image were built from common parents with common instructions, the headers and samples wouldn’t be in two different locaitions and the image tags would be the same.

I realize this would might require changes to how your repositories/registries work, both apt and Docker, but Canonical manages to get it to work. I can apt-get install “foo” and be assured that “foo” will be the same version on all architectures.


Anyway, you can take or leave this critique. Please don’t take it like I don’t like y’all’s work in general. I like Nvidia products and will continue to develop for Nvidia platforms, but I’ve been avoiding Docker on Tegra for these reasons, and I imagine I might not be the only one. Image size is a plus to this approach, yes, but that’s about it.

1 Like

Thanks for the useful feedback @mdegans - it’s on our future roadmap to move to a completely containerized approach (e.g. with CUDA/cuDNN/TensorRT/ect. installed inside of the container as opposed to mounted).

Note that if you run Docker with --runtime nvidia, it will automatically mount the JetPack files in read-only mode that are specified in the CSV’s (under /etc/nvidia-container-runtime/host-files-for-container.d). If you need these while building an image from a Dockerfile, you can edit your /etc/docker/daemon.json configuration file to include: "default-runtime": "nvidia"

Also, you could essentially create your own fully-containerized images today by disabling the CSV’s (or not using nvidia runtime during docker build) and installing the needed JetPack packages inside of your container (perhaps not even needing to use the l4t-base image). Hopefully the L4T apt server added to JetPack 4.3 makes that easier by copying the /etc/apt/sources.list.d/nvidia-l4t-apt-source.list from the host into the container and installing the packages via apt.

1 Like

Thanks, Dustin!

it’s on our future roadmap to move to a completely containerized approach (e.g. with CUDA/cuDNN/TensorRT/ect. installed inside of the container

I find that to be really good news.

Note that if you run Docker with --runtime nvidia , it will automatically mount the JetPack files in read-only mode that are specified in the CSV’s (under /etc/nvidia-container-runtime/host-files-for-container.d ).

That is extremely useful, though it might help to make what it does more explicit in the documentation. On x86 it’s been a while since I used --runtime nvidia and I don’t recall if I ever looked at how it was implemented.

Now I use --gpus and the nvidia runtime is no longer in the x86 docker list of runtimes (docker info). I can only assume --runtime nvidia will go away when your mentioned changes are made?

JetPack 4.3 makes that easier by copying the /etc/apt/sources.list.d/nvidia-l4t-apt-source.list from the host into the container

Indeed, obtaining and installing the apt key and adding to the sources does work, but the image size does end up being pretty large. It’s large on x86 as well, but it’s a one time cost for the base images.

If I distribute something, Ideally I’d like to use Nvidia’s images as a base so that people who pull my image only pull a ~100 meg delta.

My preliminary understanding is that --runtime nvidia will still be used to map /dev files that enable the CUDA libraries in the container to communicate with GPU device (in addition to the dev nodes for the codecs, camera, ect).

So. Did some experimenting becuase somebody needs an image for Tegra, and have some notes and issues. –runtime can’t be used at docker build, so that ends up being a bigger problem than I initially thought. I ended up working around that with a new base layer based off ubuntu:bionic as you suggested and somebody else tried. For anybody who is interested, here it is, MIT license:

FROM ubuntu:bionic

# This determines what <SOC> gets filled in in the nvidia apt sources list:
# valid choices: t210, t186, t194
ARG SOC="t210"
# because Nvidia has no keyserver for Tegra currently, we DL the whole BSP tarball, just for the apt key.
ARG BSP_URI="https://developer.nvidia.com/embedded/dlc/r32-3-1_Release_v1.0/t210ref_release_aarch64/Tegra210_Linux_R32.3.1_aarch64.tbz2"
ARG BSP_SHA512="13c4dd8e6b20c39c4139f43e4c5576be4cdafa18fb71ef29a9acfcea764af8788bb597a7e69a76eccf61cbedea7681e8a7f4262cd44d60cefe90e7ca5650da8a"

WORKDIR /tmp
# install apt key and configure apt sources
RUN apt-get update && apt-get install -y --no-install-recommends \
        ca-certificates \
        wget \
    && BSP_SHA512_ACTUAL="$(wget --https-only -nv --show-progress --progress=bar:force:noscroll -O- ${BSP_URI} | tee bsp.tbz2 | sha512sum -b | cut -d ' ' -f 1)" \
    && [ ${BSP_SHA512_ACTUAL} = ${BSP_SHA512} ] \
    && echo "Extracting bsp.tbz2" \
    && tar --no-same-permissions -xjf bsp.tbz2 \
    && cp Linux_for_Tegra/nv_tegra/jetson-ota-public.key /etc/apt/trusted.gpg.d/jetson-ota-public.asc \
    && chmod 644 /etc/apt/trusted.gpg.d/jetson-ota-public.asc \
    && echo "deb https://repo.download.nvidia.com/jetson/common r32 main" > /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
    && echo "deb https://repo.download.nvidia.com/jetson/${SOC} r32 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
    && rm -rf * \
    && apt-get purge -y --autoremove \
        wget \
    && rm -rf /var/lib/apt/lists/*

I then installed the bare minimum cuda dependencies for what I needed to build in a new image based off that base image.

... docker apt boilerplate ...
        cuda-compiler-10-0 \
        cuda-minimal-build-10-0 \
        cuda-libraries-dev-10-0 \
... build $THING ...
... apt purge boilerplate ...
        cuda-compiler-10-0 \
        cuda-minimal-build-10-0 \
        cuda-libraries-dev-10-0 \

$THING builds, so there’s that, and I can mount cuda at runtime. So far, so good. The issue is now some packages have scripts that depend on certain files being on the rootfs that are not present on ubuntu:bionic. Example of one:

root@d63ea9d82b79:/tmp# apt-get install nvidia-l4t-gstreamer
Reading package lists... Done
Building dependency tree       
... <snip> ...
Preparing to unpack .../nvidia-l4t-ccp-t186ref_32.3.1-20191209230245_arm64.deb ...
awk: cannot open /etc/nv_boot_control.conf (No such file or directory)
Unknown Tegra platform is detected. Package can't be installed, quit.
dpkg: error processing archive /var/cache/apt/archives/nvidia-l4t-ccp-t186ref_32.3.1-20191209230245_arm64.deb (--unpack):
 new nvidia-l4t-ccp-t186ref package pre-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 /var/cache/apt/archives/nvidia-l4t-ccp-t186ref_32.3.1-20191209230245_arm64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

So I built $THING with cuda and gstreamer support, but will have to add instructions to -v the gstreamer plugins dirs on docker run (as well as everything else) for accelerated decode to work.

It seems the required file is not in any apt packages:

root@d63ea9d82b79:/tmp# dpkg -S /etc/nv_boot_control.conf
dpkg-query: no path found matching pattern /etc/nv_boot_control.conf

I can copy it, but it looks like it’s specifc to a board. Perhaps the check for this file can be removed from the scripts for nvidia-l4t-ccp-t186ref and /proc/device-tree/compatible used instead? Most of the packages without nvidia-l4t prefix install with no issues, unfortunately this one package is setting up a roadblock.

So I’m experiencing some oddness. --runtime nvidia won’t mount anything unless it’s a l4t-base based image. Example:

 $ sudo docker run -it --runtime nvidia --rm ubuntu:bionic
root@b7633548a37f:/# ls /usr/local
bin  etc  games  include  lib  man  sbin  share  src

Do I need to create all the mount points inside the image before using --runtime Nvidia? Normally i’d just wait until the next version where the behavior changes, but I’m hoping to help somebody fix their thing sometime this week if at all possible.

OK, given you are stuck here and I’m not sure why it isn’t mounting the CSV’s, what if you added "default-runtime": "nvidia" to your /etc/docker/daemon.json file. If you go back to using l4t-base, when you docker build the the nvidia runtime will be used. I do this for example when I need to run TensorRT engine generation during the docker build.

BTW I have also used this project to build a new base image before, when I needed to install CUDA inside the container: https://github.com/jetsistant/docker-cuda-jetpack

1 Like

if you added "default-runtime": "nvidia" to your /etc/docker/daemon.json file. If you go back to using l4t-base

I modified my build script and am attempting the build. Thanks for the workaround! I will update as soon as ports.ubuntu.com stops being fail and the build actually starts (or fails).

install CUDA inside the container

Well, I can do that (and tried it, and it works), but I was hoping to just install cuda, build $THING, and and remove cuda (in the same layer) so it won’t take space in the image.

If I need to use --runtime nvidia anyway to access the GPU the way things currently work, I might as well. It just won’t run because of the lack of mounted stuff as mentioned in my last post :/

Even if cuda was in the container, the gpu devices wouldn’t be mounted and that’s a lot of --device or -v to append at docker run. Useful to build packages maybe or for a two stage build, but not as a runnable image for GPU stuff, which is what I’m aiming at.

So, unfortunately $THING does not find cudnn, while it does if installed inside with apt using the ubuntu:bionic based container (but then --runtime nvidia won’t work)

I took a look at cudnn.csv and it doesn’t look like it mounts much of cudnn other than the runtime library itself, so there are no headers to build with.

I see there are options to mount devices, dirs, libraries, and symlinks. It looks ilke lib mounts regular files. I have updated it to this and will try again:

lib, /usr/lib/aarch64-linux-gnu/libcudnn.so.7.6.3
sym, /usr/lib/aarch64-linux-gnu/libcudnn.so.7
sym, /etc/alternatives/libcudnn_so
lib, /usr/include/aarch64-linux-gnu/cudnn_v7.h
sym, /etc/alternatives/libcudnn
sym, /usr/include/cudnn.h
lib, /usr/include/aarch64-linux-gnu/cudnn_v7.h
lib, /usr/lib/aarch64-linux-gnu/libcudnn_static_v7.a
sym, /etc/alternatives/libcudnn_stlib

Yeah, it’s not finding cudnn even with the above configuration. I think I’m just going to give up on nvidia-docker on Tegra for the moment until the bind mounting approach changes. I will help the project in question with Python packaging instead, perhaps.

Sorry about that - in the upcoming JetPack update, these CSV files are amended so that they also mount the development files for cuDNN and TensorRT, including headers and top-level .so’s.

What I do to workaround this, is copy the needed development headers into my docker workspace, and then into the container. First I run this script on my Jetson (outside of docker), which copies the files I need into the packages directory under my local project’s workspace:

#
# this script copies development files and headers from the target host
# into the packages dir, which get used during building some containers
#

mkdir -p packages/usr/include
mkdir -p packages/usr/include/aarch64-linux-gnu
mkdir -p packages/usr/lib/python3.6/dist-packages

cp /usr/include/cublas*.h packages/usr/include
cp /usr/include/cudnn*.h packages/usr/include

cp /usr/include/aarch64-linux-gnu/Nv*.h packages/usr/include/aarch64-linux-gnu

cp -r /usr/lib/python3.6/dist-packages/tensorrt* packages/usr/lib/python3.6/dist-packages
cp -r /usr/lib/python3.6/dist-packages/graphsurgeon* packages/usr/lib/python3.6/dist-packages
cp -r /usr/lib/python3.6/dist-packages/uff* packages/usr/lib/python3.6/dist-packages

Then in my dockerfile, I install them into my container’s /usr directory:

COPY packages/usr /usr

You might also need to make some symlinks in your container - for example:

RUN printenv && \
    ls -ll /usr/lib/aarch64-linux-gnu/tegra && \
    ls -ll /usr/lib/aarch64-linux-gnu/libnv* && \
    ln -s /usr/lib/aarch64-linux-gnu/libnvinfer.so.6 /usr/lib/aarch64-linux-gnu/libnvinfer.so && \
    ln -s /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so.6 /usr/lib/aarch64-linux-gnu/libnvinfer_plugin.so && \
    ln -s /usr/lib/aarch64-linux-gnu/libnvparsers.so.6 /usr/lib/aarch64-linux-gnu/libnvparsers.so && \
    ln -s /usr/lib/aarch64-linux-gnu/libnvonnxparser.so.6 /usr/lib/aarch64-linux-gnu/libnvonnxparser.so && \
    ls -ll /usr/lib/aarch64-linux-gnu/libnv* && \
    ldd /usr/lib/aarch64-linux-gnu/libnvinfer.so 

Anyways, as mentioned above, this should be fixed in the next JetPack since the CSVs were amended with these entries. However that is the workaround I was using beforehand.

It’s on the roadmap to change the CSV mounting approach in lieu of fully-containered images (that have CUDA/ect installed directly into the containers), however this is not planned until the JetPack release after next.

Sorry about that - in the upcoming JetPack update, these CSV files are amended so that they also mount the development files for cuDNN and TensorRT, including headers and top-level .so’s.

No worries. I’m kind of surprised the .csv’s aren’t automatically generated based on the debian package manifests or something.

Re: scripts: Thanks, Dustin. I will try applying your workaround later today or tomorrow. Either that and/or try: add the apt sources to l4t-base, install cudnn, build $THING, and purge --autoremove cudnn as I attempted to do with ubuntu:bionic (it worked… just wouldn’t mount stuff at runtime).

For the next immediate jetpack, would it be possible to fix the issue with the nvidia runtime not working with containers that aren’t l4t-base based?

So, I added apt sources to l4t base, and $THING finds cudnn and is building. I pushed it here in case anybody finds it useful. I will test to see if it mounts needed stuff and update here.

The Dockerfile and build script is here for those wishing to avoid Docker Hub.

1 Like

Just to update: the above strategy works. I am able to:

FROM mdegans/l4t-base:latest

RUN apt-get update && apt-get install -y --autoremove \
        libfoo \
        libfoo-dev \
        cuda-foo-dev \
    && build_bar.sh \
    && apt-get purge -y --autoremove \
        libfoo-dev \
        cuda-foo-dev \
    && rm -rf /var/lib/apt/lists/*

And it mounts cuda-foo at docker run with --runtime nvidia. If I figure out what’s causing --runtime nvidia to fail on ubuntu:bionic and friends, I will update, but for the moment, this is a working solution for me without having to change the default runtime. Thanks for your help, Dustin!

combination of the two methods above allowed to build opencv450 with cudnn/cublas, finally
https://github.com/AndreV84/Jetson/blob/master/opencv450_cudnn_cublas.Dockerfile

cool.

I updated my GitHub’s docker branch the other day to 4.5.0

Built images are here:
https://hub.docker.com/r/mdegans/tegra-opencv

updates:
source: https://github.com/AndreV84/Jetson/blob/master/opencv451_nx_docker

xhost +
 docker run -it --rm --net=host --runtime nvidia -e DISPLAY=$DISPLAY --privileged --ipc=host -v /tmp/.X11-unix/:/tmp/.X11-unix/ -v /tmp/argus_socket:/tmp/argus_socket --cap-add SYS_PTRACE iad.ocir.io/idso6d7wodhe/jetson_nx/opencv541
cp -r /usr/local/opencv-4.5.1-dev/lib/python3.6/dist-packages/cv2 /usr/lib/python3.6/dist-packages/cv2
export OPENCV_VERSION=opencv-4.5.1-dev
export LD_LIBRARY_PATH=/usr/local/$OPENCV_VERSION/lib

though reading from camera still been worked out as it works from containered gstreamer but not from opencv gstreamer