manual release of GPU memory how to fix memory leak problems when relaunching the job

Hi all,

I’m having a hard time booting and rebooting again my machine… Apparently when the program is abnormally interrupted for some reason cudaFree doesn’t get to clean the GPU memory and my system ends up running out of GPU memory when I relaunch it.

Clearly something went wrong with the device I use:
[cuda]$ gpumeminfo
Detected 2 GPU
!!! cuMemGetInfo failed! (status = c9)^^^^ Device: 0
^^^^ Free : 0 bytes (0 KB) (0 MB)
^^^^ Total: 0 bytes (0 KB) (0 MB)
^^^^ nan% free, nan% used
^^^^ Device: 1
^^^^ Free : 109314048 bytes (106752 KB) (104 MB)
^^^^ Total: 131399680 bytes (128320 KB) (125 MB)
^^^^ 83.192020% free, 16.807980% used

My question is if there is any better way to release the “trapped” memory from the GPU without rebooting it each time?

thanks in advance,

I have the same question: how to manually release the whole device or GPU memory? Thanks!