SSH crash due to GPU computing when using a large amount of memory less than hardware limitation

Hello, NVIDIA experts! I am using 2 RTX 4090 GPUs for sencientic computing concurrently. There is no problem for small size problem simulations but ssh crash happens radomly for large size problems. However, there is no error output by using “cuda-gdb --silent --ex run --args ./myprogram”, and the program works well in cuda-gdb environment to complete simulation. I checked the GPU memory usuage which is less than 20 GB smaller than hardware limitation of 24 GB.
In my program, sizes of some variables are large and need to malloc and free during iterations. I am wondering whether this is caused by memory fragmentization. Could you give me some advices? Thank you.

What are the symptoms of such an “ssh crash”?

The GPU machine runs with linux system, and I access the GPU machine by using vscode or terminal in windows via ssh remote connection. The simulation of big size problem caused ssh disconnection. When I logined in again to check, the program was also stopped after ssh disconnection as nvidia-smi showed.

Perhaps a timeout? Try, whether it is possible to run tmux, a terminal multiplexer on your linux system. With it you can reconnect to sessions.

Thanks for your reply. I tried tmux before, the gpu program was stopped when I reconnect to the session after disconnection. The simulation time is around 30 minutes, I think it’s not timeout.
In this program, I manage gpu memories by using cudaMallocAsync / cudaFree for some large-size and small-size variables. During the iterations, these varying-size variables are malloced and freed. Does this cause memory fragmentations and unstability? Does the Thrust vectors for automatic memory mangement improve the stability? Very strangely, the program works in the cuda-gdb environment.