Environment:
- Cluster: NEC SX‑Aurora or similar HPC
- Host OS: Ubuntu 22.04
- Apptainer version: 1.x
- NGC Container:
nvcr.io/nvidia/pytorch:25.04-py3
(pulled aspytorch_25.04.sif
) - CUDA toolkit on host: 12.8.1 (module
cuda/12.8.1 ucx/1.18.0
viaopenmpi/5.0.7/gcc11.4.0-cuda12.8.1
) - PyTorch inside container: 2.5.0a0 (nightly), Python 3.12
- Other deps: Triton, Transformer‑Engine, Megatron‑LM v2.x
What I’m trying to do:
Run the Megatron‑LM miniature training example https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start under an Apptainer container on 2 nodes, 1 GPU/node via OpenMPI + NCCL, using Triton‑backed kernels for tensor parallelism. But on HPC with Triton I encounter a driver lookup error.
Problem:
Inside the container, Triton’s NVIDIA driver backend fails with:
AssertionError: libcuda.so cannot found!
Possible files are located at ['/usr/local/cuda/compat/lib/libcuda.so.1'].
Please create a symlink of libcuda.so to any of the files.
and I found out that the actual path to the “libcuda.so.1” is located at /usr/local/cuda/compat/lib.real/
The only workaround that succeeded was converting the SIF into a writable sandbox and creating the symlink:
# 1) Build writable sandbox
apptainer build --sandbox pytorch_sandbox_2504 pytorch_25.04.sif
# 2) Enter and patch
apptainer exec --nv --writable pytorch_sandbox_2504 bash -lc '
cd /usr/local/cuda/compat
ln -s lib.real lib
'
# 3) Run training
mpirun -np 2 --map-by ppr:1:node \
apptainer exec --nv \
pytorch_sandbox_2504 \
python run_simple_mcore_train_loop.py --timestamp $timestamp
After this, /usr/local/cuda/compat/lib/libcuda.so
is present and Triton loads the driver successfully.
Questions:
- Is there a cleaner way to satisfy Triton’s driver lookup within an Apptainer container without a sandbox conversion?
- Why does the NGC image place driver libraries under
compat/lib.real
instead ofcompat/lib
? - Any best practices for running Triton (and Transformer‑Engine) inside NGC containers on HPC systems?