Modulus-Sym examples _ ldc error

Hi, so I have been able to get the helmholtz and chip_2d cases to work. However I am getting an error when running the ldc and ldc_zeroEq models. I saw some answers on this forum but it did not help my case, the error is as such:

modulus-sym/examples/ldc# python3 ldc_2d.py
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See Changes to job's runtime working directory | Hydra for more information.
ret = run_job(
[01:56:29] - JitManager: {‘_enabled’: False, ‘_arch_mode’: <JitArchMode.ONLY_ACTIVATION: 1>, ‘_use_nvfuser’: True, ‘_autograd_nodes’: False}
[01:56:29] - GraphManager: {‘_func_arch’: False, ‘_debug’: False, ‘_func_arch_allow_partial_hessian’: True}
[01:56:34] - attempting to restore from: outputs/ldc_2d
[01:56:34] - optimizer checkpoint not found
[01:56:34] - model flow_network.0.pth not found
Error executing job with overrides:
Traceback (most recent call last):
File “ldc_2d.py”, line 136, in run
slv.solve()
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/solver/solver.py”, line 173, in solve
self._train_loop(sigterm_handler)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py”, line 535, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py”, line 716, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py”, line 68, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/solver/solver.py”, line 66, in compute_losses
return self.domain.compute_losses(step)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/domain.py”, line 147, in compute_losses
constraint.forward()
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/constraint/continuous.py”, line 130, in forward
self._output_vars = self.model(self._input_vars)
File “/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/graph.py”, line 234, in forward
outvar.update(e(outvar))
File “/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/eq/derivatives.py”, line 99, in forward
grad = gradient(var, grad_var)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/eq/derivatives.py”, line 38, in gradient
“”"
grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]
grad = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <— HERE
[
y,
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Hi @mitanshtrip

Sorry you’re having problems with the LDC example, its a bit unusual for this example to have errors. Can you tell me a little more about your environment? Is this inside the Modulus docker container? Or is this with a pip / bare metal install? What is your PyTorch version? What is the hardware you are running on?

Its odd that chip_2d works for you but this doesn’t. Can I also ask you to try the annular ring example (another N-S system). Thanks!

So I am using “WSL” on windows to run the cases. I used the pip/bare metal installation.

The PyTorch version is: 1.13.0+cu116

The annular ring example gave me the exact same error that I pasted in the query above

Thank you

Hi @mitanshtrip

Unfortunately, we don’t formally test WSL. I would verify that you’re not running out of memory on your GPU. Another item to try is to comment out the JIT decorator here on the gradient calculation and see if the CUDA error is more informative. Otherwise debugging a RuntimeError: CUDA error: unknown error is rather challenging.

Hi, so I tested these out on an hpc cluster (just a single node) and I had the same issue, helmholtz and chip_2d working but the ldc and annular ring are not working.

Instead of running it on GPU’s can this be run on CPU’s?

Hi @mitanshtrip

I just tested a bare metal install of nvidia-modulus.sym and the LDC problem runs fine on a unix V100 GPU. If you have a CPU installation, and Pytorch runs on the CPU then most of the examples should function maybe with some modification. However, I would strongly suggest against this since training will be very slow and many features will not work.

Did you try commenting out the JIT decorator? Also what hardware are you using? Confirm you are not running out of GPU memory.