Triton Runtime Error in NGC PyTorch 25.04 Apptainer Container: libcuda.so Not Found

Environment:

  • Cluster: NEC SX‑Aurora or similar HPC
  • Host OS: Ubuntu 22.04
  • Apptainer version: 1.x
  • NGC Container: nvcr.io/nvidia/pytorch:25.04-py3 (pulled as pytorch_25.04.sif)
  • CUDA toolkit on host: 12.8.1 (module cuda/12.8.1 ucx/1.18.0 via openmpi/5.0.7/gcc11.4.0-cuda12.8.1)
  • PyTorch inside container: 2.5.0a0 (nightly), Python 3.12
  • Other deps: Triton, Transformer‑Engine, Megatron‑LM v2.x

What I’m trying to do:
Run the Megatron‑LM miniature training example https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start under an Apptainer container on 2 nodes, 1 GPU/node via OpenMPI + NCCL, using Triton‑backed kernels for tensor parallelism. But on HPC with Triton I encounter a driver lookup error.

Problem:
Inside the container, Triton’s NVIDIA driver backend fails with:

AssertionError: libcuda.so cannot found!
Possible files are located at ['/usr/local/cuda/compat/lib/libcuda.so.1'].
Please create a symlink of libcuda.so to any of the files.

and I found out that the actual path to the “libcuda.so.1” is located at /usr/local/cuda/compat/lib.real/

The only workaround that succeeded was converting the SIF into a writable sandbox and creating the symlink:

# 1) Build writable sandbox
apptainer build --sandbox pytorch_sandbox_2504 pytorch_25.04.sif

# 2) Enter and patch
apptainer exec --nv --writable pytorch_sandbox_2504 bash -lc '
  cd /usr/local/cuda/compat
  ln -s lib.real lib
'

# 3) Run training
mpirun -np 2 --map-by ppr:1:node \
  apptainer exec --nv \
    pytorch_sandbox_2504 \
    python run_simple_mcore_train_loop.py --timestamp $timestamp

After this, /usr/local/cuda/compat/lib/libcuda.so is present and Triton loads the driver successfully.

Questions:

  • Is there a cleaner way to satisfy Triton’s driver lookup within an Apptainer container without a sandbox conversion?
  • Why does the NGC image place driver libraries under compat/lib.real instead of compat/lib?
  • Any best practices for running Triton (and Transformer‑Engine) inside NGC containers on HPC systems?