CUDA out of memory need to reboot the server

Hi,
it’s happening lately that I’m going in out of memory allocating memory on device.
This happen usually after interrupting a previous run of my executable. I’m on linux
using CUDA 2.3 with drivers 190.42.
when this happens doing “lsmod | grep nvidia” it seems someone is still using the
driver but X is down, and no other in the system is running cuda coda.
Is this a driver issue ?

How do u interrupt? CTRL^C or KILL -9 ??

Well usually I do a CTRL-C, but sometime having a segmentation fault,

the program exit with a KILL.

When are you facing this problem? Doing CTRL_C or KILL or When segmentation fault occurs???

Your answer is not clear.

I’m not sure when (segmentation fault is a kill anyway) because I do not alloc 4GB at each

run so I guess each time that the executable doesn’t exit in a clean way it does some “leakage”,

and the effects are seen later on.

It may be a driver glitch. Failing that, it may be that your kernel has written out of bounds and broken something, which is leading to the error message.

You might be able to get away with simply restarting the driver, rather than the whole server.

rmmod doesn’t work because it’s like the driver is still in use (lsmod | grep nvidia doesn’t shows 0).

In my application I only use cufft and cublas so not like is one of my kernels fault.

I digged into it, and it seems that is my application fault, I have still to check but some time
after my application terminate (somehow) a “thread” remains alive, killing it the memory
is release correctly.