GPU not accessible in a custom Docker container

We are building a server backend for AGX Orin which runs a variety of ML workloads. We are using Docker containers. While dustynv ollama container accesses GPU correctly, the GPU is not accessible from our main python app container, with a large part of our stack (computer vision, TTS, STT models) not being accelerated.

We have added "default-runtime": "nvidia" to our /etc/docker/daemon.json as described here which does not resolve the issue.

We have managed to install Pytorch with CUDA support on the AGX Orin directly, using the wheel files mentioned here specifically for our versions of python (3.10.12), CUDA (12.2.140), and JetPack (6.0).
However, we are struggling to use Pytorch using the same wheels from within our container, running into ValueError: libcublas.so.*[0-9] not found in the system path ['/code', '/code', '/usr/local/lib/python310.zip', '/usr/local/lib/python3.10', '/usr/local/lib/python3.10/lib-dynload', '/usr/local/lib/python3.10/site-packages'].
The library GPUtil does not recognise any GPU (while it does when run directly on the AGX Orin). Monitored with jtop, the GPU is idle while running local models through Ultralytics or Huggingface transformers libraries from our container.

Could someone point out what we are missing here?

Hi

How are you creating the containers?

Did you try manually setting --runtime nvidia in the docker run command?

1 Like

Hi,

Which base image do you use?
For Jetson device, please start with l4t-base or l4t-cuda.

Thanks.

1 Like

Thank you both, followed your recommendations. Used nvcr.io/nvidia/l4t-base:r36.2.0 as the base in Dockerfile.

Now able to see the following when running nvidia-smi inside of the container.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.3.0                Driver Version: N/A          CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, torch.cuda.device_count() returned 0, realised it’s because the correct wheel was missing.

Downloaded torch-2.4.0a0+3bcc3cddb5.nv24.07.16234504-cp310-cp310-linux_aarch64.whl and rebuilt the containers.

Then found that some torch dependencies were missing, referred to Libopenblas.so.0 not found - #5 by dusty_nv and ran ldd

ldd /usr/local/lib/python3.10/dist-packages/torch/_C.cpython-310-aarch64-linux-gnu.so
	linux-vdso.so.1 (0x0000ffffa687f000)
	libtorch_python.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so (0x0000ffffa5990000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa5960000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffa57b0000)
	libtorch.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so (0x0000ffffa5780000)
	libshm.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libshm.so (0x0000ffffa5750000)
	libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x0000ffffa5720000)
	libtorch_cpu.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so (0x0000ffff9f620000)
	libtorch_cuda.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so (0x0000ffff62130000)
	libc10_cuda.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so (0x0000ffff62080000)
	libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x0000ffff61fb0000)
	libc10.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so (0x0000ffff61ec0000)
	libcudnn.so.8 => /lib/aarch64-linux-gnu/libcudnn.so.8 (0x0000ffff61e80000)
	libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff61c50000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff61c20000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffa6846000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffff61c00000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff61b60000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffff61b40000)
	libopenblas.so.0 => not found
	libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000ffff61ae0000)
	libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x0000ffff61460000)
	libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x0000ffff50130000)
	libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x0000ffff45610000)
	libcusparseLt.so.0 => not found
	libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x0000ffff3e700000)
	libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x0000ffff37230000)
	libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x0000ffff18ac0000)
	libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x0000ffff15d90000)

Added libopenblas-dev, libcusparselt0 and libcusparselt-dev (libcusparselt ones as referenced in Compiling torchvision 0.19.0 for torch 2.4.0a0+07cecf4168.nv24.05.14710581 - #4 by AastaLLL) to Dockerfile and it’s fine, thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.