Modulus-Sym examples _ ldc error

mitanshtrip · May 20, 2023, 6:28am

Hi, so I have been able to get the helmholtz and chip_2d cases to work. However I am getting an error when running the ldc and ldc_zeroEq models. I saw some answers on this forum but it did not help my case, the error is as such:

modulus-sym/examples/ldc# python3 ldc_2d.py
/usr/local/lib/python3.8/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See Changes to job's runtime working directory | Hydra for more information.
ret = run_job(
[01:56:29] - JitManager: {‘_enabled’: False, ‘_arch_mode’: <JitArchMode.ONLY_ACTIVATION: 1>, ‘_use_nvfuser’: True, ‘_autograd_nodes’: False}
[01:56:29] - GraphManager: {‘_func_arch’: False, ‘_debug’: False, ‘_func_arch_allow_partial_hessian’: True}
[01:56:34] - attempting to restore from: outputs/ldc_2d
[01:56:34] - optimizer checkpoint not found
[01:56:34] - model flow_network.0.pth not found
Error executing job with overrides:
Traceback (most recent call last):
File “ldc_2d.py”, line 136, in run
slv.solve()
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/solver/solver.py”, line 173, in solve
self._train_loop(sigterm_handler)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py”, line 535, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py”, line 716, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py”, line 68, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/solver/solver.py”, line 66, in compute_losses
return self.domain.compute_losses(step)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/domain.py”, line 147, in compute_losses
constraint.forward()
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/constraint/continuous.py”, line 130, in forward
self._output_vars = self.model(self._input_vars)
File “/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/graph.py”, line 234, in forward
outvar.update(e(outvar))
File “/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/eq/derivatives.py”, line 99, in forward
grad = gradient(var, grad_var)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File “/usr/local/lib/python3.8/dist-packages/modulus/sym/eq/derivatives.py”, line 38, in gradient
“”"
grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]
grad = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <— HERE
[
y,
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

ngeneva · May 24, 2023, 1:52am

Hi @mitanshtrip

Sorry you’re having problems with the LDC example, its a bit unusual for this example to have errors. Can you tell me a little more about your environment? Is this inside the Modulus docker container? Or is this with a pip / bare metal install? What is your PyTorch version? What is the hardware you are running on?

Its odd that chip_2d works for you but this doesn’t. Can I also ask you to try the annular ring example (another N-S system). Thanks!

mitanshtrip · May 24, 2023, 3:27am

So I am using “WSL” on windows to run the cases. I used the pip/bare metal installation.

The PyTorch version is: 1.13.0+cu116

The annular ring example gave me the exact same error that I pasted in the query above

Thank you

ngeneva · May 24, 2023, 6:23am

Hi @mitanshtrip

Unfortunately, we don’t formally test WSL. I would verify that you’re not running out of memory on your GPU. Another item to try is to comment out the JIT decorator here on the gradient calculation and see if the CUDA error is more informative. Otherwise debugging a RuntimeError: CUDA error: unknown error is rather challenging.

mitanshtrip · May 24, 2023, 6:36am

Hi, so I tested these out on an hpc cluster (just a single node) and I had the same issue, helmholtz and chip_2d working but the ldc and annular ring are not working.

Instead of running it on GPU’s can this be run on CPU’s?

ngeneva · May 26, 2023, 2:28am

Hi @mitanshtrip

I just tested a bare metal install of nvidia-modulus.sym and the LDC problem runs fine on a unix V100 GPU. If you have a CPU installation, and Pytorch runs on the CPU then most of the examples should function maybe with some modification. However, I would strongly suggest against this since training will be very slow and many features will not work.

Did you try commenting out the JIT decorator? Also what hardware are you using? Confirm you are not running out of GPU memory.

Topic		Replies	Views
Modulus.sym ldc example RuntimeError: CUDA error: no CUDA-capable device is detected Technical Support (PhysicsNeMo Only)	3	1062	June 18, 2024
Error in running some example tutorial code (ZeroEquation) Technical Support (PhysicsNeMo Only)	1	1155	July 27, 2022
Error when running cylinder_2d.py example - GTX 1660 Technical Support (PhysicsNeMo Only)	2	1272	December 22, 2022
Pinn cuDNN	1	216	June 30, 2024
Error with accessing validation data - csv_to_dict() Technical Support (PhysicsNeMo Only)	7	1646	February 17, 2023
Fluctuating loss function Technical Support (PhysicsNeMo Only)	1	946	July 28, 2023
Nvidia Modulus: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Technical Support (PhysicsNeMo Only)	1	1855	May 18, 2022
Error while running Modulus chip 2d.py Technical Support (PhysicsNeMo Only)	2	1056	March 12, 2023
The results of ldc_2d_zeroEq.py is not the same as those in SimNet_v21.06_User_Guide Technical Support (PhysicsNeMo Only)	0	957	February 1, 2022
The results of ldc_2d.py is not the same as those in SimNet_v21.06_User_Guide Technical Support (PhysicsNeMo Only)	6	959	July 1, 2021

Modulus-Sym examples _ ldc error

Related topics