CUDA error: operation not permitted when stream is capturing

Execution of my modulus code is resulting in the following error.

[code]
[06:15:56] - attempting to restore from: outputs/Battery
[06:15:56] - Success loading optimizer: outputs/Battery/optim_checkpoint.0.pth
[06:15:56] - Success loading model: outputs/Battery/battery_network.0.pth
[06:15:57] - [step:          0] record constraint batch time:  4.146e-01s
[06:15:57] - [step:          0] saved checkpoint to outputs/Battery
[06:15:57] - [step:          0] loss:  2.148e+01
[06:16:09] - Attempting cuda graph building, this may take a bit...
Error executing job with overrides: []
Traceback (most recent call last):
  File "/modulus/modulus/trainer.py", line 728, in _cuda_graph_training_step
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/modulus/modulus/trainer.py", line 54, in adam_compute_gradients
    losses_minibatch = self.compute_losses(step)
  File "/modulus/modulus/solver/solver.py", line 52, in compute_losses
    return self.domain.compute_losses(step)
  File "/modulus/modulus/domain/domain.py", line 133, in compute_losses
    constraint.forward()
  File "/modulus/modulus/domain/constraint/continuous.py", line 116, in forward
    self._output_vars = self.model(self._input_vars)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/modulus/modulus/graph.py", line 220, in forward
    outvar.update(e(outvar))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/modulus/modulus/utils/sympy/torch_printer.py", line 274, in forward
    output = self.torch_expr(args)
  File "<lambdifygenerated-7>", line 3, in _lambdifygenerated
    return (-3.85e-11*sqrt(c)*sqrt(c_s)*sqrt(28606 - c_s)*(-2.71828**(-Phi_1 + Phi_2) + 2.71828**(Phi_1 - Phi_2)) + j_n)
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 32, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 671, in __rpow__
    return torch.tensor(other, dtype=dtype, device=self.device) ** self
RuntimeError: CUDA error: operation not permitted when stream is capturing

[/code]

The culprit seems to be the constraint corresponding to the equation

(-3.85e-11*sqrt(c)*sqrt(c_s)*sqrt(28606 - c_s)*(-2.71828**(-Phi_1 + Phi_2) + 2.71828**(Phi_1 - Phi_2)) + j_n) 

as can be seen from the error. What can be the potential causes for this issue? Is it possible that exponential terms are too large for the gradients to be computed?

Hi @shubhamsp2195

This error occurs when there’s a tensor / cuda object getting created or transferred inside a recorded graph. All CUDA objects need to be initialized and on the GPU prior to recording a graph. I’m not sure why exactly this is occurring for you, but you can shut off Cuda graphs in your config.yaml using cuda_graphs = False.