For a while I’ve been running (and developing) my Modulus based simulations on a cluster via a converted Singularity container. Long story, short… I didn’t realize the differences between docker and singularity and now need to fix my singularity setup. Instead of trying to fix the singularity submission system while improving on my simulation program, I returned to development on my local machine.
I’ve created two parameterizations, one that modifies the kinematic viscosity nu to run a single STL file for a range of Re values and, separately, one that uses a discrete geometry and multiple STL files to compare geometric changes.
Prior to my forced return to development on my local machine, both of these were working. Now, when I try to run the parameterization of ‘nu’, same code that was working on the cluster, I get the following when the Solver starts up:
File "/windtunnel/template/WindTunnel.py", line 321, in run_solver
self.solver.solve()
File "/modulus/modulus/solver/solver.py", line 159, in solve
self._train_loop(sigterm_handler)
File "/modulus/modulus/trainer.py", line 521, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File "/modulus/modulus/trainer.py", line 702, in _cuda_graph_training_step
self.loss_static, self.losses_static = self.compute_gradients(
File "/modulus/modulus/trainer.py", line 54, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File "/modulus/modulus/solver/solver.py", line 52, in compute_losses
return self.domain.compute_losses(step)
File "/modulus/modulus/domain/domain.py", line 133, in compute_losses
constraint.forward()
File "/modulus/modulus/domain/constraint/continuous.py", line 116, in forward
self._output_vars = self.model(self._input_vars)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/modulus/modulus/graph.py", line 220, in forward
outvar.update(e(outvar))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1186, in _call_impl
return forward_call(*input, **kwargs)
File "/modulus/modulus/eq/derivatives.py", line 85, in forward
grad = gradient(var, grad_var)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/modulus/modulus/eq/derivatives.py", line 24, in gradient
"""
grad_outputs: List[Optional[torch.Tensor]] = [torch.ones_like(y, device=y.device)]
grad = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <--- HERE
[
y,
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.
My input keys into the NS for this parameterization are defined as:
input_keys=[Key("x"), Key("y"), Key("z"), Key("nu")]
I pass a single float value to my ZeroEquation step for nu and then pass the ZeroEq.equations[“nu”] to my NS equation.
My working (on the cluster) version passed the following parameterization:
pr = Parameterization({Parameter("nu"): np.array(np.linspace((0.000003, 0.00005) , 10))[:,None]})
Following the ThreeFin example, each constraint is passed the following parameterization:
pr = {Parameter("nu"): (0.000003, 0.00005)}
I get the same error by specifying the vertical array or keeping it as a range. Perhaps I’m still doing something incorrectly, but it seems odd that it would work on the cluster and not locally. I thought that the parallel (slurm) part may be keeping more in memory for communication purposes so I modified the torch.autograd.grad call in derivatives.py to include “retain_graph = True”, but it must be clearing it from memory from a separate call as this resulted in the same error.
Let me know if you need any other info or if I’m just missing something
*edit because I forgot another piece.
The second parameterization is set up in a very similar way, but uses discrete geometry for multiple STL files. The parameter I use is just an integer index referring to the modified STL file. Aside from the discrete geometry part, this is the main difference between the two.