Torch Autograd fails on single GPU when using continuous parameterization

For a while I’ve been running (and developing) my Modulus based simulations on a cluster via a converted Singularity container. Long story, short… I didn’t realize the differences between docker and singularity and now need to fix my singularity setup. Instead of trying to fix the singularity submission system while improving on my simulation program, I returned to development on my local machine.

I’ve created two parameterizations, one that modifies the kinematic viscosity nu to run a single STL file for a range of Re values and, separately, one that uses a discrete geometry and multiple STL files to compare geometric changes.

Prior to my forced return to development on my local machine, both of these were working. Now, when I try to run the parameterization of ‘nu’, same code that was working on the cluster, I get the following when the Solver starts up:

  File "/windtunnel/template/WindTunnel.py", line 321, in run_solver
    self.solver.solve()
  File "/modulus/modulus/solver/solver.py", line 159, in solve
    self._train_loop(sigterm_handler)
  File "/modulus/modulus/trainer.py", line 521, in _train_loop
    loss, losses = self._cuda_graph_training_step(step)
  File "/modulus/modulus/trainer.py", line 702, in _cuda_graph_training_step
    self.loss_static, self.losses_static = self.compute_gradients(
  File "/modulus/modulus/trainer.py", line 54, in adam_compute_gradients
    losses_minibatch = self.compute_losses(step)
  File "/modulus/modulus/solver/solver.py", line 52, in compute_losses
    return self.domain.compute_losses(step)
  File "/modulus/modulus/domain/domain.py", line 133, in compute_losses
    constraint.forward()
  File "/modulus/modulus/domain/constraint/continuous.py", line 116, in forward
    self._output_vars = self.model(self._input_vars)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/modulus/modulus/graph.py", line 220, in forward
    outvar.update(e(outvar))
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
    return forward_call(*input, **kwargs)
  File "/modulus/modulus/eq/derivatives.py", line 85, in forward
    grad = gradient(var, grad_var)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/modulus/modulus/eq/derivatives.py", line 24, in gradient
    """
    grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]
    grad = torch.autograd.grad(
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
        [
            y,
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

My input keys into the NS for this parameterization are defined as:

input_keys=[Key("x"), Key("y"), Key("z"), Key("nu")]

I pass a single float value to my ZeroEquation step for nu and then pass the ZeroEq.equations[“nu”] to my NS equation.

My working (on the cluster) version passed the following parameterization:

pr = Parameterization({Parameter("nu"): np.array(np.linspace((0.000003, 0.00005)  , 10))[:,None]}) 

Following the ThreeFin example, each constraint is passed the following parameterization:

pr = {Parameter("nu"): (0.000003, 0.00005)}

I get the same error by specifying the vertical array or keeping it as a range. Perhaps I’m still doing something incorrectly, but it seems odd that it would work on the cluster and not locally. I thought that the parallel (slurm) part may be keeping more in memory for communication purposes so I modified the torch.autograd.grad call in derivatives.py to include “retain_graph = True”, but it must be clearing it from memory from a separate call as this resulted in the same error.

Let me know if you need any other info or if I’m just missing something

*edit because I forgot another piece.
The second parameterization is set up in a very similar way, but uses discrete geometry for multiple STL files. The parameter I use is just an integer index referring to the modified STL file. Aside from the discrete geometry part, this is the main difference between the two.

Hi @patterson

Perhaps a little late, but this is very unusual that things work differently of different systems. My only guess right now is that somehow Modulus is getting confused between the parameterized nu and the ZeroEq.equations[“nu”].

I.e. when unrolling the symbolic graph, perhaps Modulus is injecting constants from the parameterization into the graph which is then causing things to get messed up as Pytorch attempts to calculate autograd on what it thinks are outputs of the turbulent equation but are really just constants. (This is just speculation)

I would perhaps try using another name for the input key Key(nu) you give to the ZeroEquation (e.g. nu0) and then use that in your parameterization. I’m just guessing, I don’t think this explains why some items work on your cluster vs local.

Not too late… I’ll give this a try. It’s secondary to what I’m trying to achieve with the STL parameterization, but it’d still be nice to have.