Triton Runtime Error in NGC PyTorch 25.04 Apptainer Container: libcuda.so Not Found

Environment:

  • Cluster: NEC SX‑Aurora or similar HPC
  • Host OS: Ubuntu 22.04
  • Apptainer version: 1.x
  • NGC Container: nvcr.io/nvidia/pytorch:25.04-py3 (pulled as pytorch_25.04.sif)
  • CUDA toolkit on host: 12.8.1 (module cuda/12.8.1 ucx/1.18.0 via openmpi/5.0.7/gcc11.4.0-cuda12.8.1)
  • PyTorch inside container: 2.5.0a0 (nightly), Python 3.12
  • Other deps: Triton, Transformer‑Engine, Megatron‑LM v2.x

What I’m trying to do:
Run the Megatron‑LM miniature training example https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start under an Apptainer container on 2 nodes, 1 GPU/node via OpenMPI + NCCL, using Triton‑backed kernels for tensor parallelism. But on HPC with Triton I encounter a driver lookup error.

Problem:
Inside the container, Triton’s NVIDIA driver backend fails with:

AssertionError: libcuda.so cannot found!
Possible files are located at ['/usr/local/cuda/compat/lib/libcuda.so.1'].
Please create a symlink of libcuda.so to any of the files.

and I found out that the actual path to the “libcuda.so.1” is located at /usr/local/cuda/compat/lib.real/

The only workaround that succeeded was converting the SIF into a writable sandbox and creating the symlink:

# 1) Build writable sandbox
apptainer build --sandbox pytorch_sandbox_2504 pytorch_25.04.sif

# 2) Enter and patch
apptainer exec --nv --writable pytorch_sandbox_2504 bash -lc '
  cd /usr/local/cuda/compat
  ln -s lib.real lib
'

# 3) Run training
mpirun -np 2 --map-by ppr:1:node \
  apptainer exec --nv \
    pytorch_sandbox_2504 \
    python run_simple_mcore_train_loop.py --timestamp $timestamp

After this, /usr/local/cuda/compat/lib/libcuda.so is present and Triton loads the driver successfully.

Questions:

  • Is there a cleaner way to satisfy Triton’s driver lookup within an Apptainer container without a sandbox conversion?
  • Why does the NGC image place driver libraries under compat/lib.real instead of compat/lib?
  • Any best practices for running Triton (and Transformer‑Engine) inside NGC containers on HPC systems?
1 Like

I don’t really have an answer to your questions, but instead of a symlink, it’s better to add the path to the system ld path:

echo “/usr/local/cuda/compat/lib.real” >> /etc/ld.so.conf
ldconfig

Put that in the %post section of your .def file.

# IMPORTANT: ignore user-site so we use container's Triton, not ~/.local
export PYTHONNOUSERSITE=1
export SINGULARITYENV_PYTHONNOUSERSITE=1

# Tell Triton exactly where libcuda.so.1 lives INSIDE the container
export SINGULARITYENV_TRITON_LIBCUDA_PATH="/usr/local/cuda/compat/lib"

# Also help the dynamic linker
export SINGULARITYENV_LD_LIBRARY_PATH="/usr/local/cuda/compat/lib:${LD_LIBRARY_PATH:-}"

Run this before running the container. E.g., here I have a Python shell with venv on top of the container, so I don’t have to make it writable

SIF=“/containers/pytorch_25.10-py3.sif”
VENV=“/virtualenvs/ngc-pytorch-25.10”

Run venv python inside the container, preserving CWD and enabling GPUs.

exec singularity exec --nv \
  --bind /data:/data --bind "$HOME:/home/$USER" \
  --pwd "$PWD" \
  "$SIF" bash -lc '
set -euo pipefail
exec "$VENV/bin/python" "$@"
' python-wrapper "$@"

I don’t know the answer but 7 month ago, I hope your problem resolved bro