szorl
October 8, 2024, 11:28am
1
We are building a server backend for AGX Orin which runs a variety of ML workloads. We are using Docker containers. While dustynv ollama container accesses GPU correctly, the GPU is not accessible from our main python app container, with a large part of our stack (computer vision, TTS, STT models) not being accelerated.
We have added "default-runtime": "nvidia"
to our /etc/docker/daemon.json
as described here which does not resolve the issue.
We have managed to install Pytorch with CUDA support on the AGX Orin directly, using the wheel files mentioned here specifically for our versions of python (3.10.12), CUDA (12.2.140), and JetPack (6.0).
However, we are struggling to use Pytorch using the same wheels from within our container, running into ValueError: libcublas.so.*[0-9] not found in the system path ['/code', '/code', '/usr/local/lib/python310.zip', '/usr/local/lib/python3.10', '/usr/local/lib/python3.10/lib-dynload', '/usr/local/lib/python3.10/site-packages']
.
The library GPUtil does not recognise any GPU (while it does when run directly on the AGX Orin). Monitored with jtop, the GPU is idle while running local models through Ultralytics or Huggingface transformers libraries from our container.
Could someone point out what we are missing here?
Hi
How are you creating the containers?
Did you try manually setting --runtime nvidia in the docker run command?
1 Like
Hi,
Which base image do you use?
For Jetson device, please start with l4t-base or l4t-cuda .
Thanks.
1 Like
szorl
October 11, 2024, 6:09pm
6
Thank you both, followed your recommendations. Used nvcr.io/nvidia/l4t-base:r36.2.0
as the base in Dockerfile
.
Now able to see the following when running nvidia-smi
inside of the container.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.3.0 Driver Version: N/A CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Orin (nvgpu) N/A | N/A N/A | N/A |
| N/A N/A N/A N/A / N/A | Not Supported | N/A N/A |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
However, torch.cuda.device_count()
returned 0
, realised it’s because the correct wheel was missing.
Downloaded torch-2.4.0a0+3bcc3cddb5.nv24.07.16234504-cp310-cp310-linux_aarch64.whl and rebuilt the containers.
Then found that some torch dependencies were missing, referred to Libopenblas.so.0 not found - #5 by dusty_nv and ran ldd
ldd /usr/local/lib/python3.10/dist-packages/torch/_C.cpython-310-aarch64-linux-gnu.so
linux-vdso.so.1 (0x0000ffffa687f000)
libtorch_python.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so (0x0000ffffa5990000)
libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffffa5960000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffa57b0000)
libtorch.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so (0x0000ffffa5780000)
libshm.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libshm.so (0x0000ffffa5750000)
libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x0000ffffa5720000)
libtorch_cpu.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so (0x0000ffff9f620000)
libtorch_cuda.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so (0x0000ffff62130000)
libc10_cuda.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so (0x0000ffff62080000)
libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x0000ffff61fb0000)
libc10.so => /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so (0x0000ffff61ec0000)
libcudnn.so.8 => /lib/aarch64-linux-gnu/libcudnn.so.8 (0x0000ffff61e80000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000ffff61c50000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffff61c20000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffa6846000)
librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffff61c00000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff61b60000)
libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffff61b40000)
libopenblas.so.0 => not found
libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000ffff61ae0000)
libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x0000ffff61460000)
libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x0000ffff50130000)
libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x0000ffff45610000)
libcusparseLt.so.0 => not found
libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x0000ffff3e700000)
libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x0000ffff37230000)
libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x0000ffff18ac0000)
libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x0000ffff15d90000)
Added libopenblas-dev
, libcusparselt0
and libcusparselt-dev
(libcusparselt
ones as referenced in Compiling torchvision 0.19.0 for torch 2.4.0a0+07cecf4168.nv24.05.14710581 - #4 by AastaLLL ) to Dockerfile
and it’s fine, thank you.
system
Closed
November 5, 2024, 6:18am
8
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.