GPU not accessible in a custom Docker container

szorl · October 8, 2024, 11:28am

We are building a server backend for AGX Orin which runs a variety of ML workloads. We are using Docker containers. While dustynv ollama container accesses GPU correctly, the GPU is not accessible from our main python app container, with a large part of our stack (computer vision, TTS, STT models) not being accelerated.

We have added "default-runtime": "nvidia" to our /etc/docker/daemon.json as described here which does not resolve the issue.

We have managed to install Pytorch with CUDA support on the AGX Orin directly, using the wheel files mentioned here specifically for our versions of python (3.10.12), CUDA (12.2.140), and JetPack (6.0).
However, we are struggling to use Pytorch using the same wheels from within our container, running into ValueError: libcublas.so.*[0-9] not found in the system path ['/code', '/code', '/usr/local/lib/python310.zip', '/usr/local/lib/python3.10', '/usr/local/lib/python3.10/lib-dynload', '/usr/local/lib/python3.10/site-packages'].
The library GPUtil does not recognise any GPU (while it does when run directly on the AGX Orin). Monitored with jtop, the GPU is idle while running local models through Ultralytics or Huggingface transformers libraries from our container.

Could someone point out what we are missing here?

allan.navarro · October 8, 2024, 8:18pm

Hi

How are you creating the containers?

Did you try manually setting --runtime nvidia in the docker run command?

AastaLLL · October 9, 2024, 5:09am

Hi,

Which base image do you use?
For Jetson device, please start with l4t-base or l4t-cuda.

Thanks.

szorl · October 11, 2024, 6:09pm

Thank you both, followed your recommendations. Used nvcr.io/nvidia/l4t-base:r36.2.0 as the base in Dockerfile.

Now able to see the following when running nvidia-smi inside of the container.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.3.0                Driver Version: N/A          CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, torch.cuda.device_count() returned 0, realised it’s because the correct wheel was missing.

Downloaded torch-2.4.0a0+3bcc3cddb5.nv24.07.16234504-cp310-cp310-linux_aarch64.whl and rebuilt the containers.

Then found that some torch dependencies were missing, referred to Libopenblas.so.0 not found - #5 by dusty_nv and ran ldd

ldd /usr/local/lib/python3.10/dist-packages/torch/_C.cpython-310-aarch64-linux-gnu.so
	linux-vdso.so.1 (0x0000ffffa687f000)
	libtorch_python.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so (0x0000ffffa5990000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa5960000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffa57b0000)
	libtorch.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so (0x0000ffffa5780000)
	libshm.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libshm.so (0x0000ffffa5750000)
	libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x0000ffffa5720000)
	libtorch_cpu.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so (0x0000ffff9f620000)
	libtorch_cuda.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so (0x0000ffff62130000)
	libc10_cuda.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so (0x0000ffff62080000)
	libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x0000ffff61fb0000)
	libc10.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so (0x0000ffff61ec0000)
	libcudnn.so.8 => /lib/aarch64-linux-gnu/libcudnn.so.8 (0x0000ffff61e80000)
	libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff61c50000)
	libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff61c20000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffa6846000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffff61c00000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff61b60000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffff61b40000)
	libopenblas.so.0 => not found
	libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000ffff61ae0000)
	libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x0000ffff61460000)
	libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x0000ffff50130000)
	libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x0000ffff45610000)
	libcusparseLt.so.0 => not found
	libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x0000ffff3e700000)
	libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x0000ffff37230000)
	libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x0000ffff18ac0000)
	libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x0000ffff15d90000)

Added libopenblas-dev, libcusparselt0 and libcusparselt-dev (libcusparselt ones as referenced in Compiling torchvision 0.19.0 for torch 2.4.0a0+07cecf4168.nv24.05.14710581 - #4 by AastaLLL) to Dockerfile and it’s fine, thank you.

system · November 5, 2024, 6:18am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Installation of single node microK8s on the NVIDIA Jetson AGX Orin Developer Kit Jetson AGX Orin kubernetes	4	102	August 28, 2024
PyTorch + CUDA11.4 on 6.0.8.1 DRIVE AGX Orin General driveos-cuda	11	1814	February 7, 2024
Use Jetson AGX Orin’s GPU from Rootless Docker Jetson AGX Orin docker	11	757	June 6, 2024
Problem to build a docker container and use the GPU on JETSON AGX ORIGIN Jetson AGX Orin docker	3	564	August 30, 2023
Upgrading CUDA for Autoware Compatibility and tensorrt libs not Accessible Inside the l4t-jetpack DRIVE AGX Orin General driveos-cuda	10	763	January 22, 2024
How to get cuda working in a docker container for pytorch applications Jetson AGX Orin docker , pytorch	3	402	May 30, 2024
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda' Jetson AGX Xavier cuda	8	45545	October 18, 2021
Disable mount plugins Jetson AGX Xavier cuda , docker , cudnn	5	1163	October 18, 2021
GPU becomes unavailable after some time in Docker container CUDA Setup and Installation	4	3407	October 12, 2021
OSError: libcurand.so.10: cannot open shared object file: No such file or directory Jetson AGX Xavier docker , pytorch	2	545	June 21, 2022

GPU not accessible in a custom Docker container

Related topics