Consider a CUDA program that has a memory leak – some device memory allocated using [font=“Courier New”]cudaMalloc()[/font] is not being freed using [font=“Courier New”]cudaFree()[/font].
In this case, is the CUDA device memory properly cleaned up after this leaky program has exited? Or does this leak persist even after the demise of the program and cause problems/slowups for later CUDA programs?
The reason I ask this is because I’m sometimes executing my code many times, single stepping through the host code and such. And suddenly on some invocation, the CUDA program starts to freeze up and then come back to normal after a few seconds. A few invocations after this, the CUDA program totally freezes the computer leaving me no choice but to reboot.
The only way I can explain this slow transition from normality to freezing up is if either me or the graphics driver is leaking memory on the device somehow over many invocations.
I’ve killed many processes doing CUDA work in a row, and not seen a leak. One thing to check is that the processes are actually dead. If you are building up an accumulation of zombie processes, they might continue to hold onto memory on the device even though they are not doing anything.
This doesn’t seem to be quite that clear cut. I’ve seen many instances where my allocated memory didn’t get free’d after a crash (or after I forgot to free it myself). A good general practice is to use cuMemGetInfo at the end of program to see if everything shut down nicely. (You might think “Hey, why not do it first thing when I start up?” Well, turns out that this is only returns the proper amount of free memory after the context is up, which may not be when I think it is, if you’re using the RT)
Coming back to the original problem: When the device almost runs out of memory, it can happen that it simply dies. I’m seeing this in some situations where I still have something like 8MB of VRAM unallocated and as soon as I start my kernel, the computer freezes. This is with valgrind-verified code, BTW.
I used to have other weird "depends on the lunar phase’ kind of bugs with my 8600GT, which I was never able to attribute to anything. Sometimes, I would simply get repeated unspecified launch errors. Rebooting would always help, sometimes running another kernel was enough… It was serious voodoo-stuff. I considered getting some rubber chickens to sacrifice to the damn thing.
We’ve documented this behavior in other threads, and also (through the help of tmurray) found that all hardware released after (and including) Tesla C1060 does not exhibit the problem. It is an extremely rare, but extraordinarily annoying problem.
Could you do a bandwidthtest for me? I am considering to replace all PC’s at my work with 285-equipped boxes. Good experiences are certainly recomforting before placing the order (as are very nice bandwidth numbers ;))
Sure. But be aware that I’m not on a PCIE 2.0 board. So Host<->Device is severely gimped. The one thing that is annoying about the card (and all 2xx cards) is that the profiler doesn’t tell you the number of uncoalesced memory accesses any more.
Other than that, it’s a really nice and fast piece of hardware.
Thanks, that is some nice device-device number! I know about the profiler thing (have a C1060 and FX4800), but that is because there are no uncoalesced memory accesses anymore ;)
On the page 55 -56 of CUDA Programming Guide 2.1, there is a description:
and
In other words, the cards with a computation capability 1.2 and higher make an optimization an global memory access, it make its great effort to reduce the uncoalescence memory access.
On the page 55 -56 of CUDA Programming Guide 2.1, there is a description:
and
In other words, the cards with a computation capability 1.2 and higher make an optimization an global memory access, it make its great effort to reduce the uncoalescence memory access.
Yes, but where before it was a boolean (coalesced or not) in 1.2 or higher devices you would need to have an indication of how many memory transactions were issued. Because a previously uncoalesced access can in >=1.2 generate between 1 and 16 transactions.