I am new to modulus and am running the ldc case. I am able to create an outputs folder, however, when compiling I run into the issue related to validation data as posted below
[00:48:58] - JIT using the NVFuser TorchScript backend
[00:48:58] - JitManager: {‘_enabled’: True, ‘_arch_mode’: <JitArchMode.ONLY_ACTIVATION: 1>, ‘_use_nvfuser’: True, ‘_autograd_nodes’: False}
[00:48:58] - GraphManager: {‘_func_arch’: False, ‘_debug’: False, ‘_func_arch_allow_partial_hessian’: True}
Error executing job with overrides:
ValueError: could not convert string to float: ‘oid sha256:4c68adf2b0a04c53f0abd4d3920f3fec618669399638dd5ece84785f474d1fa6’
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “ldc_2d.py”, line 82, in run
openfoam_var = csv_to_dict(
Hello, So I had previously installed git-lfs (git-lfs/3.2.0 (GitHub; linux amd64; go 1.18.2)). I recloned the examples directory, this time I am getting a different error as stated below. Is this mainly because of the installed pytorch version?
/examples/ldc# python3 ldc_2d.py
[12:24:27] - JIT using the NVFuser TorchScript backend
[12:24:27] - Disabling JIT because functorch does not work with it.
[12:24:27] - JitManager: {‘_enabled’: False, ‘_arch_mode’: <JitArchMode.ONLY_ACTIVATION: 1>, ‘_use_nvfuser’: True, ‘_autograd_nodes’: False}
[12:24:27] - GraphManager: {‘_func_arch’: True, ‘_debug’: False, ‘_func_arch_allow_partial_hessian’: True}
[12:24:31] - Arch Node: flow_network has been converted to a FuncArch node.
[12:24:31] - Installed PyTorch version 1.13.0+cu116 is not TorchScript supported in Modulus. Version 1.13.0a0+d321be6 is officially supported.
[12:24:31] - attempting to restore from: outputs/ldc_2d
[12:24:31] - optimizer checkpoint not found
[12:24:31] - model flow_network.0.pth not found
Error executing job with overrides:
It is actually a very long long message, I have copy pasted some from there below:
Error executing job with overrides:
Traceback (most recent call last):
File “ldc_2d.py”, line 116, in run
slv.solve()
File “/usr/local/lib/python3.8/dist-packages/modulus-22.9-py3.8.egg/modulus/solver/solver.py”, line 159, in solve
self._train_loop(sigterm_handler)
File “/usr/local/lib/python3.8/dist-packages/modulus-22.9-py3.8.egg/modulus/trainer.py”, line 593, in _train_loop
self._record_constraints()
File “/usr/local/lib/python3.8/dist-packages/modulus-22.9-py3.8.egg/modulus/trainer.py”, line 275, in _record_constraints
self.record_constraints()
File “/usr/local/lib/python3.8/dist-packages/modulus-22.9-py3.8.egg/modulus/solver/solver.py”, line 116, in record_constraints
self.domain.rec_constraints(self.network_dir)
File “/usr/local/lib/python3.8/dist-packages/modulus-22.9-py3.8.egg/modulus/domain/domain.py”, line 45, in rec_constraints
constraint.save_batch(constraint_data_dir + key)
File “/usr/local/lib/python3.8/dist-packages/modulus-22.9-py3.8.egg/modulus/domain/constraint/continuous.py”, line 60, in save_batch
pred_outvar = modl(invar)
File “/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/local/lib/python3.8/dist-packages/functorch/_src/eager_transforms.py”, line 113, in _autograd_grad
grad_inputs = torch.autograd.grad(diff_outputs, inputs, grad_outputs,
File “/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py”, line 300, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Thanks. This seems to be some issue with the gradient calculations. Please try turning off functorch for gradient calculations by putting the following in your config.yaml file for this problem:
graph:
func_arch: false
If that does not work then I would try shutting off CUDA graphs with cuda_graphs: false
Is this an error or just a warning? Typically this message is just from a warning that JIT is not being used which has little to no performance gain most of the time. You can also turn this off by shutting off JIT in your config using jit: false.