Modulus container no longer functions after updating to latest display + cuda drivers

bsarkar · November 3, 2022, 1:28am

Hello,

We have an 8x GPU (v100) system running RHEL8 that we’ve been using with modulus for research (EDU).
Our IT department recently updated the system, installing the latest gpu drivers + cuda available from the nvidia stream, but now the modulus container no longer functions.
Running any modulus script within the container we now get the following error message:

...
RuntimeError: CUDA error: system has unsupported display driver / cuda driver combination
...

Our currently installed driver and cuda versions are:

| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |

I’ve repulled the latest modulus image from NGC (22.09), that says it was updated Oct 31st, and see the same error.
Yet this document indicates that v22.09 deep learning containers should be supported on our driver/cuda combination:

In the meantime, I’ve managed to get a bare-metal install of modulus running in a conda environment, it’s giving me warnings about unsupported pytorchscript version conflicts (I’ll start another topic about that), but otherwise seems to run fine with the same scripts that fail in the container.

I’m using the 22.09 release from the modulus gitlab, which I assume is the same as within the 22.09 container images - so I wonder if the CUDA runtime error is being thrown by an older PyTorch version within the container that needs to be updated?

asubramaniam · November 3, 2022, 5:31pm

Hi @bsarkar, does the NGC PyTorch container work for you or does that fail too?
nvcr.io/nvidia/pytorch:22.08-py3

bsarkar · November 4, 2022, 5:43pm

Yes, that one does seem to work

❯ singularity exec --nv ./pytorch.22.08-py3.sif /bin/bash                                                                                                (base)
Singularity> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True

Running the same in the modulus image, without executing my scripts directly, I see a few more error messages at the top regarding the compat lib during container startup

❯ singularity exec --nv ./modulus_22.09.simg /bin/bash                                                                                                   (base)
/usr/bin/rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
Singularity> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False
>>>

I don’t see the compat folder in my local filesystem, so is that part of the container image overlay?

bsarkar · November 4, 2022, 6:00pm

ok, with that information in hand I was able to create a writeable copy of the container

 singularity overlay create --size 1024 modulus_22.09.simg

Which now allows the removal of the compat library during startup

❯ singularity exec --nv --writable ./modulus_22.09.simg /bin/bash                                                                                        (base)
WARNING: nv files may not be bound with --writable
bash: fish: command not found
Singularity> python
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
True

(despite the warning about nv files not being writable, this still seemed to work)

Is this a fundamental difference between docker / singularity?
i.e. system not mutable unless explicitly build that way?

We can’t use docker in our university HPC environment, so I had followed the ngc instructions on creating singularity builds.

Topic		Replies	Views
How to use CUDA compatibility package to use a newer driver on an older kernel module CUDA Setup and Installation	8	4988	July 8, 2019
CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	7	32093	May 18, 2024
Nvidia Driver 390.87 + CUDA, Ubuntu 16.04 docker container, Python3. Host RHEL7.5 Container: CUDA	2	4279	October 12, 2021
CUDA Repo. Update Issues - NVIDIA-RedHat Linux CUDA Setup and Installation cuda	3	1153	September 25, 2024
command "docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi" fails with Error CUDA Setup and Installation	1	10000	January 16, 2019
Running Cuda on Docker CUDA Setup and Installation	7	17312	May 23, 2016
Unable to run CUDA programs in Singularity containers from NGC Container: CUDA cuda , containers	2	5449	October 12, 2021
Nvidia-container-cli: detection error: nvml error: function not found: unknown CUDA Programming and Performance cuda , ubuntu , docker	5	7977	April 24, 2021
Issues with cuda-12.6.0-1.x86_64 from RHEL8 repo CUDA Setup and Installation	12	3865	September 4, 2024
CUDA driver version is insufficient for CUDA runtime version Jetson AGX Orin cuda , containers , jetson	4	682	July 3, 2024

Modulus container no longer functions after updating to latest display + cuda drivers

Related topics