Problem with bare metal install for 22.09 in WSL: CUDA graphs may only be used in Pytorch built with CUDA >= 11

Hi,

I have problems with docker in WSL for 22.09. Hence, I tried to do a bare metal install. I tried to run the helmholtz example. It ran for a few steps but gave the error:

[11:12:04] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[11:12:04] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[11:12:09] - attempting to restore from: outputs/helmholtz
[11:12:09] - Success loading optimizer: outputs/helmholtz/optim_checkpoint.0.pth
[11:12:09] - Success loading model: outputs/helmholtz/wave_network.0.pth
[11:12:10] - [step:          0] record constraint batch time:  9.617e-02s
[11:12:11] - [step:          0] record validators time:  1.156e+00s
[11:12:11] - [step:          0] saved checkpoint to outputs/helmholtz
[11:12:11] - [step:          0] loss:  9.899e+03
[11:12:13] - Attempting cuda graph building, this may take a bit...
Error executing job with overrides: []
Traceback (most recent call last):
  File "helmholtz.py", line 92, in run
    slv.solve()
  File "/home/user/modulus_22.09/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/solver/solver.py", line 159, in solve
    self._train_loop(sigterm_handler)
  File "/home/user/modulus_22.09/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/trainer.py", line 521, in _train_loop
    loss, losses = self._cuda_graph_training_step(step)
  File "/home/user/modulus_22.09/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/trainer.py", line 724, in _cuda_graph_training_step
    self.g = torch.cuda.CUDAGraph()
  File "/home/user/modulus_22.09/lib/python3.8/site-packages/torch/cuda/graphs.py", line 50, in __init__
    super(CUDAGraph, self).__init__()
RuntimeError: CUDA graphs may only be used in Pytorch built with CUDA >= 11.0 and not yet supported on ROCM

So how can I solve this error?

Thanks.

Hi @tsltaywb

Try turning cuda graphs off in the config.yaml file by adding: cuda_graphs: False.

Have a look at helmholtz/conf/config_hardBC.yaml as an example with this setting. This will disable cuda graph compiling. Keep in mind Cuda graphs is a beta feature in PyTorch so support may be limited like seems to be the case here.

Hi,

I have the same problem as @tsltaywb but, in my case the file config_hardBC.yaml was already modified with the solution you expressed. The problem persists even if i change cuda_graphs: True , so it looks like this is not the source of the problem.

I have CUDA 11.3 and the correct version of pytorch (installed from this link with conda).

How can i solve?

Thanks

Hi @tom_02

Its not 100% clear what your problem is. The original post is specifically regarding a Cuda graphs error which should not appear with cuda graphs off.

To shut if off, you need to edit the config for the example you are running (for example, for this issue in the original post this needs to be added to examples/helmholtz/conf/config.yaml). The hardBC config is for helmholtz_hardBC.py.

Let me know if helmholtz.py does not run after modifying the correct config file.

Thanks, this worked for me.