Error running 22.07 container with examples - Failed to create shim task

Seems like 22.02 will not install? This is my current error. Can anyone help me with what this is or how to fix it?

docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -v ${PWD}/examples:/examples -it modulus:22.07 bash
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as ‘legacy’
nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/ca5de1d2dbb200798f83372e1d274ce3e2fe6773eb5805ed35859d4e2c02e76f/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

Hi @ckitchell ,

Is there any chance your running NVIDIA Container Toolkit on WSL? There seems to presently be a known issue with nvidia-docker on Windows systems.

More information:

Yes. I have been trying to WSL… is there another easier way… 22.03 and 22.07 see to be giving the same issues for me. I read online WSL is not a good choice.

Depending on your needs you could try a bare metal installation. Most of the utilities in Modulus should work if PyTorch works on your system, but of course we encourage the docker image for consistency between our users development environment.

Alternatively you could look into a cloud based service.

Hi all, I am on WSL and it’s working well up to 22.03.1.

I tried the solution given by ngeneva. I deleted the files, one by one when the error lib is given, eg:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/a92497fde29f5e4a16659087de1978a2ff7cf59a53b410f240467c3aead3f609/merged/usr/lib/x86_64-linux-gnu/libnvcuvid.so.1: file exists: unknown.
ERRO[0000] error waiting for container: context canceled

So I deleted /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1

In the end, after deleting approx 6 files, there’s no more error msg but modulus reports:

ERROR: No supported GPU(s) detected to run this container

I run “nvidia-smi” and it did reported my GPU:

root@d7a49cf80974:/examples# nvidia-smi
Tue Sep 27 08:59:08 2022
±----------------------------------------------------------------------------+
| NVIDIA-SMI 515.75 Driver Version: 517.40 CUDA Version: 11.7 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:01:00.0 On | N/A |
| 27% 33C P8 13W / 275W | 628MiB / 11264MiB | 2% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

However, running the example still gives error:

Error executing job with overrides:
Traceback (most recent call last):
File “helmholtz.py”, line 92, in run
slv.solve()
File “/modulus/modulus/solver/solver.py”, line 159, in solve
self._train_loop(sigterm_handler)
File “/modulus/modulus/trainer.py”, line 521, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File “/modulus/modulus/trainer.py”, line 694, in _cuda_graph_training_step
self.warmup_stream = torch.cuda.Stream()
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/streams.py”, line 34, in new
return super(Stream, cls).new(cls, priority=priority, **kwargs)
RuntimeError: CUDA error: no CUDA-capable device is detected
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

So the GPU still can’t work correctly.

Anyone has a solution?

Thanks!

Hi @tsltaywb

I would start with just getting PyTorch working and making sure the GPU is visible to PyTorch prior to running Modulus.

>>> import torch
>>> torch.cuda.is_available() # Should be true
>>> torch.cuda.device_count() # Should be 1
>>> torch.cuda.current_device()
>>> torch.cuda.device(0)
>>> torch.cuda.get_device_name(0)

Once just PyTorch works with the GPU then Modulus should function.

Hi ngeneva,

Well, the problem is that w/o deleting libnvidia* and libcuda* files, I can’t enter the docker modulus environment. But if I entered after deleting these files, the GPU can’t work:
torch.cuda.is_available() will be false

Btw, 22.03.1 is working. I realise that in the docker modulus dir, there’s modulus and external dir. Can I overwrite them with the newer 22.09 ? Will I get the new features of 22.09 after I did this?

Thanks.

Hi @tsltaywb

In theory yes you could do that with the modulus folder which should allow most PyTorch related features should function. The external folder is for the 2 external dependencies of Modulus (pysdf and tinycudann). I would be careful copying these over because these are compiled during build for the docker image. Could be worth a try if you want pysdf functionality.

I’ve seen you’ve figured out some work around with the 22.08 pytorch container, may also want to try some hacking with that method. Alternatively you could comment out the PySDF items in the docker file in the main repo and build your image with the same folder structure as the one we ship. Then try bringing over a pre-compiled PySDF files from the 22.09 container.

Thanks for updating the forums with your solutions for others!