Im running Ubuntu 16.04 and coda 7.5 and nvidia drivers 361 for my 2 Tesla k40c GPUs.
I was able to run the cuda samples (specifically vectorAdd).
I ran a few cud programs using cutorch and now when I try to run vectorAdd, it says
$ sudo ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code out of memory)!
If i restart the machines, things work again, and then stop after a few runs of my code. This was also happening earlier with ubuntu 14.
The debug log is here: http://sprunge.us/hhaM
Thanks in advance!
a process of yours (presumably in your cutorch workflow) is terminating in a bad fashion and not freeing memory
normal process termination should release any allocations.
You could try using the reset facility in nvidia-smi to try to reset the GPUs in question. If that is possible, it should fix the issue without a reboot. You could also try to identify any processes associated with the GPU in question using nvidia-smi and kill those processes manually.
otherwise you’ll need to identify your process termination issues and rectify them, or else reboot the system.
There was a bug in certain drivers where the memory was not released if the process was terminated.
Try to use the latest 361 driver, I don’t remember in which version was fixed.
I assume you meant “There was a bug in certain drivers where the memory was not released if the process was terminated abnormally” ?
Thanks for the reply guys. Still no luck.
I successfully reset both GPUs in my machine using nvidia-smi
According the nvidia-smi, there are no processes running that are using the gpu.
I am using NVIDIA-SMI 361.42 … which I installed just a few days ago.
I cannot reboot the machine since many others are logged in.
I tried ‘rmmod’ followed by ‘modprobe’ of the nvidia driver. Even that didn’t fix it.
Is there something else I can do to refresh everything and emulate the effect of rebooting? Thanks!