Hi,
I have a student learning Modulus. His GPU is GTX 1660 Super. He’s using the docker file from pytorch under WSL. We installed Modulus 22.09 based on the instructions here:
https://forums.developer.nvidia.com/t/problem-using-modulus-22-07-in-wsl2/226578/2
The ldc_2d.py and helmotz.py examples worked but the cylinder_2d.py failed. The error msg is:
root@8210ecb7e13a:/mini_examples_22.09/cylinder2# python cylinder_2d.py
[13:03:04] - JIT using the NVFuser TorchScript backend
[13:03:04] - JitManager: {‘_enabled’: True, ‘_arch_mode’: <JitArchMode.ONLY_ACTIVATION: 1>, ‘_use_nvfuser’: True, ‘_autograd_nodes’: False}
[13:03:04] - GraphManager: {‘_func_arch’: False, ‘_debug’: False, ‘_func_arch_allow_partial_hessian’: True}
length scale is 20 meter
time scale is 20.0 second
mass scale is 8000.0 kilogram
[13:03:08] - attempting to restore from: outputs/cylinder_2d
[13:03:08] - optimizer checkpoint not found
[13:03:08] - model flow_network.0.pth not found
/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/eq/derivatives.py:85: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback
*** (Triggered internally at /opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/manager.cpp:329.)***
*** grad = gradient(var, grad_var)***
Error executing job with overrides: []
Traceback (most recent call last):
*** File “cylinder_2d.py”, line 176, in run***
*** slv.solve()***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/solver/solver.py”, line 159, in solve***
*** self._train_loop(sigterm_handler)***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/trainer.py”, line 521, in _train_loop***
*** loss, losses = self._cuda_graph_training_step(step)***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/trainer.py”, line 702, in _cuda_graph_training_step***
*** self.loss_static, self.losses_static = self.compute_gradients(***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/trainer.py”, line 54, in adam_compute_gradients***
*** losses_minibatch = self.compute_losses(step)***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/solver/solver.py”, line 52, in compute_losses***
*** return self.domain.compute_losses(step)***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/domain/domain.py”, line 133, in compute_losses***
*** constraint.forward()***
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/domain/constraint/continuous.py”, line 116, in forward***
*** self._output_vars = self.model(self._input_vars)***
*** File “/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1186, in _call_impl***
*** return forward_call(input, kwargs)
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/graph.py”, line 220, in forward***
*** outvar.update(e(outvar))***
*** File “/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1186, in _call_impl***
*** return forward_call(input, kwargs)
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/eq/derivatives.py”, line 85, in forward***
*** grad = gradient(var, grad_var)***
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
*** File “/opt/conda/lib/python3.8/site-packages/modulus-22.9-py3.8.egg/modulus/eq/derivatives.py”, line 24, in gradient***
*** “”"***
*** grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]***
*** grad = torch.autograd.grad(***
*** ~~~~~~~~~~~~~~~~~~~ <— HERE***
*** [***
*** y,***
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
*** File “”, line 119, in fallback_cuda_fuser***
*** def backward(grad_output):***
*** input_sigmoid = torch.sigmoid(self)***
*** return grad_output * (input_sigmoid * (1 + self * (1 - input_sigmoid)))***
*** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <— HERE***
*** return result, backward***
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
root@8210ecb7e13a:/mini_examples_22.09/cylinder2#
May I know what’s wrong? Is it due to the GPU or driver problem?
Thanks.