Error running Modulus with ntk, stan and lr_annealing

NEA_taywb · February 23, 2024, 1:57am

Hi,

I have a Modulus code which predicts flow past airfoils. It is running fine but we are trying to improve its accuracy. I tried to add ntk, stan and lr_annealing to the code.

However, I got the error:

root@3bd493862967:/modulus_projects/FX63-180_2_element_airfoil_varyAoA2_t2_Re300# python FX63-180_2_element_airfoil_4var_gelu_flow_AoA_0to10_AoA2_0to10_Re300_ntk_stan_lra.py > log_FX63-180_2_element_airfoil_4var_gelu_flow_AoA_0to10_AoA2_0to10_Re300_ntk_stan_lra.txt
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See Changes to job's runtime working directory | Hydra for more information.
ret = run_job(
Error executing job with overrides:
Traceback (most recent call last):
File “/modulus_projects/FX63-180_2_element_airfoil_varyAoA2_t2_Re300/FX63-180_2_element_airfoil_4var_gelu_flow_AoA_0to10_AoA2_0to10_Re300_ntk_stan_lra.py”, line 442, in run
slv.solve()
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 173, in solve
self._train_loop(sigterm_handler)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 551, in _train_loop
loss, losses = self.compute_gradients(
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/trainer.py”, line 76, in adam_compute_gradients
losses_minibatch = self.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/solver/solver.py”, line 66, in compute_losses
return self.domain.compute_losses(step)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/domain/domain.py”, line 157, in compute_losses
losses, self.ntk_weights = self.ntk(
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py”, line 1527, in _call_impl
return forward_call(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/loss/aggregator.py”, line 607, in forward
ntk_dict = self.group_ntk(constraint.model, constraint_losses)
File “/usr/local/lib/python3.10/dist-packages/modulus/sym/loss/aggregator.py”, line 566, in group_ntk
grad = torch.autograd.grad(
File “/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py”, line 394, in grad
result = Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

May I know what’s wrong? There’s some mention of ntk in the error.
I also realised that if I run in debug mode with low batch size and simple geometry, it can start training. So does it mean that its out of memory problem? However, unlike other cases, there’s no mention of out of memory.

Meanwhile, I’ll try to remove some of these new features and try again.

ngeneva · February 24, 2024, 1:12am

Hi @NEA_taywb

I have not seen this before, a quick google suggest this could be from some out of range issue. Admittedly I am not super familiar with the NTK part of the code.

Have you tried some of the alternative architectures in the package?

NEA_taywb · February 26, 2024, 8:53am

Sure, I’ll try other advanced schemes. Thanks!

Topic		Replies	Views
Nvidia Modulus: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Technical Support (PhysicsNeMo Only)	2	1838	May 18, 2022
Out of memory when running Modulus with enhanced gradient Technical Support (PhysicsNeMo Only)	1	1007	March 13, 2024
Fluctuating loss function Technical Support (PhysicsNeMo Only)	2	926	July 28, 2023
Pinn cuDNN	1	201	June 30, 2024
Can I use RTX8000? Technical Support (PhysicsNeMo Only) cuda	2	994	December 21, 2022
Modulus-Sym examples _ ldc error Technical Support (PhysicsNeMo Only)	5	915	May 26, 2023
Error when running cylinder_2d.py example - GTX 1660 Technical Support (PhysicsNeMo Only)	2	1256	December 22, 2022
How to use Learning Rate Annealing etc in Modulus Technical Support (PhysicsNeMo Only)	2	943	February 21, 2024
Just Released: NVIDIA Modulus v23.05 Technical Blog	0	237	May 9, 2023
Just Released: NVIDIA Modulus v23.05 Technical Blog	0	290	May 9, 2023

Error running Modulus with ntk, stan and lr_annealing

Related topics