cudaMalloc segfaulting Possible cause?

About 6 hours into my run I’m getting a problem with cudaMalloc segfaulting (under linux) or causing very odd behavior (under windows). It happens on the largest memory allocation that I have in my program.

I’m fairly sure I must be doing something wrong somewhere, but I can’t work out what I’m doing wrong. I’m just trying to rule out possibilities.

Am I right in thinking this can’t be caused by the GPU? I would expect it to return an error code rather than segfaulting if the malloc fails. No previous calls return error codes either.

If it’s not the GPU it’s on the host side. I’ve noticed a few memory leaks reported by valgrind in libcudart.so (one on cudaMalloc… though not the one that segfaults, and one on cudaGetDeviceCount), however upgrading my memory from 3GB to 4GB didn’t change the behavior of the program at all. I can only think that something on the host is overflowing and overwriting something which doesn’t like being overwritten.

Given the 6 hour in nature of the problem it’s painfully hard to work with. Any thoughts?

It is a bug in the driver, a fix will be available shortly.

Using the latest windows driver (178.13) I am still seeing exactly the same behaviour. I’ve just set off a test to see if it’s giving any error codes this time (I accidently ran it without my error checking).

Just a bit of background to my program wrt memory allocation - it allocates ever increasing amounts of memory at the start of an iteration in several allocations most of which are ~1-8 megs, some of which are ~60 megs and then one ~140 megs and one ~230 megs. At the end of each iteration it frees the memory, and the next iteration it allocates slightly more. It dies on the 286th iteration.

cuMemGetInfo reports that I’m using ~83% of the overall memory (used: ~745 megs, free ~150 megs - it’s a 892 meg card).

I’m still not completely ruling out an error on my part, however cudaMalloc segfaulting on linux certainly seemed bad… and I’m getting the same behaviour with the new windows driver as the behaviour that corresponded with the segfault previously, at around about the same point (both runs have one particular memory allocation just over ~150 megs… worringly close to the amount of free memory, but I think unrelated - it still dies on runs with slightly more free memory). This all leads me to think that it’s somehow related to memory fragmentation on the card (GTX 260) not leaving space for the big chunk of memory that needs to be allocated.

Anyways - can somebody confirm that the bug mentioned is fixed in the latest driver please?

EDIT: Running a test overnight with full error checking and some fairly mean input checks. I’ll be back in about 16 hours!

I don’t think anyone said it was fixed already.

Well a fix ‘soon’ followed about a week later by a new driver kinda implied it.

Sorry, but the fixes (cudaMalloc and watchdog) are not in 178.13.

Ok. Thanks for the info!

Ouch! Godspeed getting the new driver out.