CUDA sometimes not available in nvcr.io/nvidia/pytorch:25.09-py3

When running inside a docker container I sometimes (not consistent) get —

W1105 06:16:44.530000 75 torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME=‘/usr/local/cuda’

Given that this is a container specifically designed to have Cuda and PyTorch I find it surprising. It seems to be happen intermittently, possibly specific to one of the machines in the pool. I any case, nvidia-smi inside the container reports a healthy machine, with GPUs and Cuda.

Here is the full error output —

+ python benchmarks/benchmark_attn.py

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

W1105 06:16:57.201000 284 torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

Traceback (most recent call last):
  File "/tmp/workspace/fa4/benchmarks/benchmark_attn.py", line 35, in <module>

    if torch.cuda.get_device_capability()[0] != 9:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 598, in get_device_capability

    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 614, in get_device_properties

    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 410, in _lazy_init

    torch._C._cuda_init()

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

Hey mate, I have run into that same CUDA error in Docker before while working on the Hypic Project it’s usually an environment setup issue. Try restarting the container with gpus all and make sure the NVIDIA runtime is properly enabled. Also check that CUDA_HOME and LD LIBRARY PATH are set correctly. A clean rebuild of the image often clears the intermittent no CUDA runtime warning.

This is an Nvidia-supplied container — nvcr.io/nvidia/pytorch:25.09-py3
I would expect it to be setup properly already.

I am launching it like so —
$ docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --tty --detach --security-opt seccomp=unconfined --shm-size=4g -v /home/jld/pytorch-integration-testing/pytorch-integration-testing:/tmp/workspace -w /tmp/workspace ``nvcr.io/nvidia/pytorch:25.09-py3