Out of memory when running Modulus with enhanced gradient


I have a Modulus code which predicts flow past airfoils. It is running fine but we are trying to improve its accuracy. I tried to add the enhanced gradient to the code. I ran a simple debug version with very low batch size and simple geometry and it started doing the training w/o error.

However, I ran the actual code, I keep getting out of memory, even after I kept reducing the batch size and also the layers and neurons.

Here’s the errors:

Error executing job with overrides:
Traceback (most recent call last):
File “/scratch/users/nus/tsltaywb/ai/modulus/myprojects_2401/e186_wing_AoA_12/e186_wing_AoA_12_EG.py”, line 614, in run
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 173, in solve
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 543, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 724, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 76, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 66, in compute_losses
return self.domain.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/domain/domain.py”, line 147, in compute_losses
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/domain/constraint/continuous.py”, line 130, in forward
self._output_vars = self.model(self._input_vars)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1527, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/graph.py”, line 234, in forward
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1527, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py”, line 99, in forward
grad = gradient(var, grad_var)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py”, line 38, in gradient
grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]
grad = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <— HERE
RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 9.06 MiB is free. Including non-PyTorch memory, this process has 39.38 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 867.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So it doesn’t really make sense that it still don’t work after all these reductions. Does enhanced gradient really requires a lot of memory?

I am facing the same problem