Out of memory when running Modulus with enhanced gradient

NEA_taywb · February 23, 2024, 2:05am

Hi,

I have a Modulus code which predicts flow past airfoils. It is running fine but we are trying to improve its accuracy. I tried to add the enhanced gradient to the code. I ran a simple debug version with very low batch size and simple geometry and it started doing the training w/o error.

However, I ran the actual code, I keep getting out of memory, even after I kept reducing the batch size and also the layers and neurons.

Here’s the errors:

Error executing job with overrides:
Traceback (most recent call last):
File “/scratch/users/nus/tsltaywb/ai/modulus/myprojects_2401/e186_wing_AoA_12/e186_wing_AoA_12_EG.py”, line 614, in run
slv.solve()
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 173, in solve
self._train_loop(sigterm_handler)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 543, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 724, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 76, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 66, in compute_losses
return self.domain.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/domain/domain.py”, line 147, in compute_losses
constraint.forward()
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/domain/constraint/continuous.py”, line 130, in forward
self._output_vars = self.model(self._input_vars)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1527, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/graph.py”, line 234, in forward
outvar.update(e(outvar))
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1527, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py”, line 99, in forward
grad = gradient(var, grad_var)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/eq/derivatives.py”, line 38, in gradient
“”"
grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]
grad = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <— HERE
[
y,
RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 9.06 MiB is free. Including non-PyTorch memory, this process has 39.38 GiB memory in use. Of the allocated memory 38.01 GiB is allocated by PyTorch, and 867.90 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So it doesn’t really make sense that it still don’t work after all these reductions. Does enhanced gradient really requires a lot of memory?

FengtuWang · March 13, 2024, 9:51am

I am facing the same problem

Topic		Replies	Views
Fluctuating loss function Technical Support (PhysicsNeMo Only)	2	926	July 28, 2023
"CUDA out of memory" error when running Helmholtz example in Modulus Report a Bug (PhysicsNeMo Only) cuda	5	1703	February 10, 2023
Error running Modulus with ntk, stan and lr_annealing Technical Support (PhysicsNeMo Only)	2	1122	February 26, 2024
Torch Autograd fails on single GPU when using continuous parameterization Report a Bug (PhysicsNeMo Only)	2	982	April 25, 2023
Modulus release_22.09 - helmholz example fails with RuntimeError: CUDA out of memory on GeForce GTX 1650 Report a Bug (PhysicsNeMo Only)	3	1685	November 21, 2022
Nvidia Modulus: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Technical Support (PhysicsNeMo Only)	2	1838	May 18, 2022
torch.OutOfMemoryError: CUDA out of memory when training model Linux pytorch , ai-training , training , natural-language-processing-nlp , ai-model-training	0	895	January 6, 2025
CUDA_ERROR_OUT_OF_MEMORY: out of memory cuDNN cuda , tensorflow , windows-driver	1	2100	July 31, 2023
Triton CUDA error: out of memory cuDNN inference-server-triton	1	1771	August 21, 2023
Just Released: NVIDIA Modulus v24.07 Technical Blog	1	62	July 30, 2024

Out of memory when running Modulus with enhanced gradient

Related topics