Hello,
We have an 8x GPU (v100) system running RHEL8 that we’ve been using with modulus for research (EDU).
Our IT department recently updated the system, installing the latest gpu drivers + cuda available from the nvidia stream, but now the modulus container no longer functions.
Running any modulus script within the container we now get the following error message:
...
RuntimeError: CUDA error: system has unsupported display driver / cuda driver combination
...
Our currently installed driver and cuda versions are:
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
I’ve repulled the latest modulus image from NGC (22.09), that says it was updated Oct 31st, and see the same error.
Yet this document indicates that v22.09 deep learning containers should be supported on our driver/cuda combination:
In the meantime, I’ve managed to get a bare-metal install of modulus running in a conda environment, it’s giving me warnings about unsupported pytorchscript version conflicts (I’ll start another topic about that), but otherwise seems to run fine with the same scripts that fail in the container.
I’m using the 22.09 release from the modulus gitlab, which I assume is the same as within the 22.09 container images - so I wonder if the CUDA runtime error is being thrown by an older PyTorch version within the container that needs to be updated?
Hi @bsarkar, does the NGC PyTorch container work for you or does that fail too?
nvcr.io/nvidia/pytorch:22.08-py3
Yes, that one does seem to work
❯ singularity exec --nv ./pytorch.22.08-py3.sif /bin/bash (base)
Singularity> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
Running the same in the modulus image, without executing my scripts directly, I see a few more error messages at the top regarding the compat lib during container startup
❯ singularity exec --nv ./modulus_22.09.simg /bin/bash (base)
/usr/bin/rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
Singularity> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
False
>>>
I don’t see the compat folder in my local filesystem, so is that part of the container image overlay?
ok, with that information in hand I was able to create a writeable copy of the container
singularity overlay create --size 1024 modulus_22.09.simg
Which now allows the removal of the compat library during startup
❯ singularity exec --nv --writable ./modulus_22.09.simg /bin/bash (base)
WARNING: nv files may not be bound with --writable
bash: fish: command not found
Singularity> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True
(despite the warning about nv files not being writable, this still seemed to work)
Is this a fundamental difference between docker / singularity?
i.e. system not mutable unless explicitly build that way?
We can’t use docker in our university HPC environment, so I had followed the ngc instructions on creating singularity builds.