When running inside a docker container I sometimes (not consistent) get —
W1105 06:16:44.530000 75 torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME=‘/usr/local/cuda’
Given that this is a container specifically designed to have Cuda and PyTorch I find it surprising. It seems to be happen intermittently, possibly specific to one of the machines in the pool. I any case, nvidia-smi inside the container reports a healthy machine, with GPUs and Cuda.
Here is the full error output —
+ python benchmarks/benchmark_attn.py
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
W1105 06:16:57.201000 284 torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Traceback (most recent call last):
File "/tmp/workspace/fa4/benchmarks/benchmark_attn.py", line 35, in <module>
if torch.cuda.get_device_capability()[0] != 9:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 598, in get_device_capability
prop = get_device_properties(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 614, in get_device_properties
_lazy_init() # will define _get_device_properties
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py", line 410, in _lazy_init
torch._C._cuda_init()
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.