SSH crash due to GPU computing when using a large amount of memory less than hardware limitation

straum · October 21, 2024, 1:55am

Hello, NVIDIA experts! I am using 2 RTX 4090 GPUs for sencientic computing concurrently. There is no problem for small size problem simulations but ssh crash happens radomly for large size problems. However, there is no error output by using “cuda-gdb --silent --ex run --args ./myprogram”, and the program works well in cuda-gdb environment to complete simulation. I checked the GPU memory usuage which is less than 20 GB smaller than hardware limitation of 24 GB.
In my program, sizes of some variables are large and need to malloc and free during iterations. I am wondering whether this is caused by memory fragmentization. Could you give me some advices? Thank you.

njuffa · October 21, 2024, 2:12am

What are the symptoms of such an “ssh crash”?

straum · October 21, 2024, 2:44am

The GPU machine runs with linux system, and I access the GPU machine by using vscode or terminal in windows via ssh remote connection. The simulation of big size problem caused ssh disconnection. When I logined in again to check, the program was also stopped after ssh disconnection as nvidia-smi showed.

Curefab · October 21, 2024, 9:42am

Perhaps a timeout? Try, whether it is possible to run tmux, a terminal multiplexer on your linux system. With it you can reconnect to sessions.

straum · October 21, 2024, 10:31am

Thanks for your reply. I tried tmux before, the gpu program was stopped when I reconnect to the session after disconnection. The simulation time is around 30 minutes, I think it’s not timeout.
In this program, I manage gpu memories by using cudaMallocAsync / cudaFree for some large-size and small-size variables. During the iterations, these varying-size variables are malloced and freed. Does this cause memory fragmentations and unstability? Does the Thrust vectors for automatic memory mangement improve the stability? Very strangely, the program works in the cuda-gdb environment.

Topic		Replies	Views
Prolonged execution of cuda code causes massive computer slowdown CUDA Programming and Performance	5	1456	February 26, 2014
Why is ~300 MiB of GPU RAM used by "nothing"? CUDA Programming and Performance	8	1704	February 22, 2018
Computation crash = stuck at 574mhz CUDA Programming and Performance	0	482	August 2, 2015
Computation crash = stuck at 574mhz CUDA Programming and Performance	0	515	August 2, 2015
Ubuntu 16.04 CUDA8 crashing graphics driver Linux	5	1613	October 14, 2021
How to find leaks? cuda-gdb runs out of memory, but compute-sanitizer runs without erros CUDA-GDB	9	4447	March 22, 2023
Computation crash = stuck at 574mhz CUDA Programming and Performance	9	1301	August 4, 2015
GPU loss while running very simple deep learning code possibly memory based Linux	4	1002	February 1, 2019
Reset GPU withour restart after CUDA crash CUDA Programming and Performance	4	3221	May 16, 2013
Simulate GPU Failure CUDA Programming and Performance	1	1270	May 23, 2016

SSH crash due to GPU computing when using a large amount of memory less than hardware limitation

Related topics