We are using an HP SL390 board with 8 GPUs:
Device 0: “Tesla M2050”
CUDA Driver Version / Runtime Version 3.20 / 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2687 MBytes (2817982464 bytes)
(Driver version 260.19.12)
After some large amount of testing without a reboot, we were unable to allocate memory on the device. While deviceQuery showed 2.6 GB available, cudaMalloc would always fail on device 0. The other devices were fine.
I’ve seen similar problems posted here, mostly on the Windows side for some time, but never saw an adequate resolution. Getting this sort of failure requiring a reboot is a real problem in our environment. It would seem to be a memory leak, perhaps in the driver. Has anyone else run into this? Is there a known fix?