Can I use RTX8000?

con2 · December 17, 2022, 3:29am

I want to use NVIDIA Modulus 22.09 with docker, on Ubuntu22.04, but I get an error.

root@50a7161248f2:/examples/examples/three_fin_2d# python heat_sink.py
/opt/conda/lib/python3.8/site-packages/hydra/_internal/callbacks.py:26: UserWarning: Callback ModulusCallback.on_job_start raised RuntimeError: Running CUDA fuser is only supported on CUDA builds.
warnings.warn(
[02:57:43] - Arch Node: heat_network has been converted to a FuncArch node.
[02:57:49] - Arch Node: flow_network has been converted to a FuncArch node.
[02:57:50] - Arch Node: heat_network has been converted to a FuncArch node.
[02:57:51] - Arch Node: flow_network has been converted to a FuncArch node.
[02:57:51] - attempting to restore from: outputs/heat_sink [02:57:51] - optimizer checkpoint not found [02:57:51] - model flow_network.0.pth not found [02:57:51] - model heat_network.0.pth not found Error executing job with overrides: Traceback (most recent call last):
File “heat_sink.py”, line 275, in run
slv.solve()
File “/modulus/modulus/solver/solver.py”, line 159, in solve
self._train_loop(sigterm_handler)
File “/modulus/modulus/trainer.py”, line 521, in _train_loop
loss, losses = self._cuda_graph_training_step(step)
File “/modulus/modulus/trainer.py”, line 694, in _cuda_graph_training_step
self.warmup_stream = torch.cuda.Stream()
File “/opt/conda/lib/python3.8/site-packages/torch/cuda/streams.py”, line 34, in new
return super(Stream, cls).new(cls, priority=priority, **kwargs)
RuntimeError: CUDA error: no CUDA-capable device is detected CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Driver is >515, but I can’t.

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1857 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+
WARNING: infoROM is corrupted at gpu 0000:D8:00.0

Can Modulus use RTX8000?

ngeneva · December 19, 2022, 5:38pm

Hi @con2

Seems this is some error that’s occurring with cuda graphs. We don’t currently test Modulus on RTX8000 so unfortunately I cannot have a complete solution (however we have tested it fine on other Quadros).
I would try shutting off cuda graphs in your config.yaml file:

cuda_graphs: False

Does the base line Helmholtz example work for you?

con2 · December 21, 2022, 3:11am

Thanks for the advice.
Results are as follows.

helmholts.py: Work

root@9d811d51116e:/examples/examples/helmholtz# python helmholtz.py
[23:50:39] - JIT using the NVFuser TorchScript backend
[23:50:39] - JitManager: {'_enabled': True, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[23:50:39] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[23:50:43] - attempting to restore from: outputs/helmholtz
[23:50:43] - optimizer checkpoint not found
[23:50:43] - model wave_network.0.pth not found
~~~
[00:06:43] - [step:      19900] loss:  1.077e-02, time/iteration:  4.483e+01 ms
[00:06:48] - [step:      20000] record constraint batch time:  3.775e-02s
[00:06:49] - [step:      20000] record validators time:  5.587e-01s
[00:06:49] - [step:      20000] saved checkpoint to outputs/helmholtz
[00:06:49] - [step:      20000] loss:  1.047e-02, time/iteration:  5.821e+01 ms
[00:06:49] - [step:      20000] reached maximum training steps, finished training!

heat_sink.py: Not work
heat_sink.py -cuda_graphs:False: Work, but too Slow

root@9d811d51116e:/examples/examples/three_fin_2d# python heat_sink.py
/opt/conda/lib/python3.8/site-packages/hydra/_internal/callbacks.py:26: UserWarning: Callback ModulusCallback.on_job_start raised RuntimeError: Running CUDA fuser is only supported on CUDA builds.
  warnings.warn(
[00:45:21] - Arch Node: heat_network has been converted to a FuncArch node.
[00:45:27] - Arch Node: flow_network has been converted to a FuncArch node.
[00:45:28] - Arch Node: heat_network has been converted to a FuncArch node.
[00:45:29] - Arch Node: flow_network has been converted to a FuncArch node.
[00:45:29] - attempting to restore from: outputs/heat_sink
[00:45:29] - optimizer checkpoint not found
[00:45:29] - model flow_network.0.pth not found
[00:45:29] - model heat_network.0.pth not found
[00:46:16] - [step:          0] record constraint batch time:  1.185e+01s
[00:46:17] - [step:          0] record validators time:  1.201e+00s
[00:46:17] - [step:          0] record monitor time:  1.111e-01s
[00:46:17] - [step:          0] saved checkpoint to outputs/heat_sink
[00:46:17] - [step:          0] loss:  1.812e+00
[01:43:23] - [step:        100] loss:  4.028e-01, time/iteration:  3.425e+04 ms
[02:37:05] - [step:        200] loss:  3.519e-01, time/iteration:  3.222e+04 ms

2nd
heat_sink.py -cuda_graphs:False:
Compared to the first, the second calculation is faster.

root@27c5850e9313:/examples/examples/three_fin_2d# python heat_sink.py
[02:51:41] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[02:51:41] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[02:51:51] - attempting to restore from: outputs/heat_sink
[02:51:51] - Success loading optimizer: outputs/heat_sink/optim_checkpoint.0.pth
[02:51:51] - Success loading model: outputs/heat_sink/flow_network.0.pth
[02:51:51] - Success loading model: outputs/heat_sink/heat_network.0.pth
[02:51:53] - [step:          0] record constraint batch time:  1.854e-01s
[02:51:53] - [step:          0] record validators time:  1.181e-01s
[02:51:53] - [step:          0] record monitor time:  3.515e-02s
[02:51:53] - [step:          0] saved checkpoint to outputs/heat_sink
[02:51:53] - [step:          0] loss:  1.266e+01
[02:52:20] - [step:        100] loss:  4.566e-01, time/iteration:  2.673e+02 ms
[02:52:47] - [step:        200] loss:  3.499e-01, time/iteration:  2.709e+02 ms
[02:53:14] - [step:        300] loss:  2.853e-01, time/iteration:  2.698e+02 ms
[02:53:41] - [step:        400] loss:  3.362e-01, time/iteration:  2.708e+02 ms
[02:54:08] - [step:        500] loss:  1.882e-01, time/iteration:  2.690e+02 ms
[02:54:35] - [step:        600] loss:  3.896e-01, time/iteration:  2.674e+02 ms
[02:55:02] - [step:        700] loss:  2.518e-01, time/iteration:  2.710e+02 ms
[02:55:29] - [step:        800] loss:  2.079e-01, time/iteration:  2.709e+02 ms

But it doesn’t converge.

[12:13:50] - [step:     125200] loss:  6.925e-02, time/iteration:  2.709e+02 ms
[12:14:17] - [step:     125300] loss:  6.975e-02, time/iteration:  2.697e+02 ms
[12:14:44] - [step:     125400] loss:  1.216e-01, time/iteration:  2.689e+02 ms
[12:15:11] - [step:     125500] loss:  7.264e-02, time/iteration:  2.687e+02 ms
[12:15:38] - [step:     125600] loss:  6.104e-02, time/iteration:  2.686e+02 ms
[12:16:04] - [step:     125700] loss:  7.468e-02, time/iteration:  2.675e+02 ms

Topic		Replies	Views
Modulus container no longer functions after updating to latest display + cuda drivers Technical Support (PhysicsNeMo Only) cuda , driver , rhel	3	1666	November 4, 2022
Nvidia Modulus: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Technical Support (PhysicsNeMo Only)	2	1807	May 18, 2022
WSL Modulus Docker run error (libnvidia-ml.so.1: file exists: unknown.) Technical Support (PhysicsNeMo Only)	7	6436	June 12, 2023
GPU Compatibility with NVIDIA Modulus Technical Support (PhysicsNeMo Only)	3	898	February 9, 2023
Error when running cylinder_2d.py example - GTX 1660 Technical Support (PhysicsNeMo Only)	2	1234	December 22, 2022
Modulus 22.07 Container version for Linux issue Report a Bug (PhysicsNeMo Only)	10	2316	August 5, 2022
Error in Installation Modulus v22.03 Technical Support (PhysicsNeMo Only) modulus	6	1801	July 28, 2022
What is the recommend hardware/GPU for Modulus Technical Support (PhysicsNeMo Only)	5	2253	January 12, 2024
Just Released: NVIDIA Modulus 23.08 Technical Blog	0	377	August 10, 2023
Modulus.sym ldc example RuntimeError: CUDA error: no CUDA-capable device is detected Technical Support (PhysicsNeMo Only)	3	1030	June 18, 2024

Can I use RTX8000?

Related topics