Error running Modulus with ntk, stan and lr_annealing

Hi,

I have a Modulus code which predicts flow past airfoils. It is running fine but we are trying to improve its accuracy. I tried to add ntk, stan and lr_annealing to the code.

However, I got the error:

root@3bd493862967:/modulus_projects/FX63-180_2_element_airfoil_varyAoA2_t2_Re300# python FX63-180_2_element_airfoil_4var_gelu_flow_AoA_0to10_AoA2_0to10_Re300_ntk_stan_lra.py > log_FX63-180_2_element_airfoil_4var_gelu_flow_AoA_0to10_AoA2_0to10_Re300_ntk_stan_lra.txt
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See Changes to job's runtime working directory | Hydra for more information.
ret = run_job(
Error executing job with overrides:
Traceback (most recent call last):
File “/modulus_projects/FX63-180_2_element_airfoil_varyAoA2_t2_Re300/FX63-180_2_element_airfoil_4var_gelu_flow_AoA_0to10_AoA2_0to10_Re300_ntk_stan_lra.py”, line 442, in run
slv.solve()
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 173, in solve
self._train_loop(sigterm_handler)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 551, in _train_loop
loss, losses = self.compute_gradients(
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 76, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 66, in compute_losses
return self.domain.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/domain/domain.py”, line 157, in compute_losses
losses, self.ntk_weights = self.ntk(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1527, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/loss/aggregator.py”, line 607, in forward
ntk_dict = self.group_ntk(constraint.model, constraint_losses)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/loss/aggregator.py”, line 566, in group_ntk
grad = torch.autograd.grad(
File “/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py”, line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

May I know what’s wrong? There’s some mention of ntk in the error.
I also realised that if I run in debug mode with low batch size and simple geometry, it can start training. So does it mean that its out of memory problem? However, unlike other cases, there’s no mention of out of memory.

Meanwhile, I’ll try to remove some of these new features and try again.

Hi @tsltaywb

I have not seen this before, a quick google suggest this could be from some out of range issue. Admittedly I am not super familiar with the NTK part of the code.

Have you tried some of the alternative architectures in the package?

Sure, I’ll try other advanced schemes. Thanks!