Hi. I’m getting an unspecified launch failure in my code. After reading all of the various posts on this issue, it seems that most people with this problem had a memory access mistakes in their kernels or allocation mistakes in their host code. For my code however, I have good reason to think that there are no such access or allocation mistakes. Furthermore, the behavior of the code suggests that I’m dealing with some unknown (to me) memory limit.
My code operates on a contiguous block of memory (allocated by cudaMalloc) of size 12 * X * Y * sizeof(type), where X, Y and type are specified by the user. For type = double, my code works perfectly for sufficiently small X and Y, but fails somewhere between X * Y = 700 * 350 and 750 * 375. That’s not very big - around 25 Mb, so it’s clearly not a device memory limit (4 gigs). If I change my type to float however, the code once again functions perfectly. If I keep type = float and double the system size, the failure occurs again.
In one of the many posts I’ve read, someone suggested a limit to the amount of contiguous memory that could be allocated with cudaMalloc, but didn’t elaborate. Is there any merit to this idea? Can someone suggest some techniques for investigating this problem? Thanks.
You might not be able to allocate the whole device memory in one cudaMalloc() call because of address space fragmentation (there might not be a contiguous free part of address space of that size left). I’d be surprised however if this already prevented allocation of just 25Mb.
For the definitive answer whether or not your cudaMalloc() call is failing, check its return code.
To see whether any other stray memory access cause the problem, run your program under [font=“Courier New”]cuda-memcheck[/font].
Thanks for the responses. Yeah - I’m pretty sure it’s not fragmentation, since the 25Mb array is by far the largest thing I allocate when I run the code.
You’re right. My initial error trapping was lazy. I’ve since improved it, and I now know that I’m getting “unspecified launch failure” in my kernel. I then made a very simple code to try and reproduce the problem. It too fails for an array size that should be easily manageable, however it fails with a different error message “invalid configuration argument”. Coincidence? I hope not, or else I have two problems, since this test code REALLY should work. Here is the test code:
I just realized how easy it is to use cuda-memcheck, so I did. My test code reports no errors, but my regular code reports one out-of-bounds error. I guess that means I’ll go through my regular kernel again looking for bad addresses. That still doesn’t explain why the test code fails.
To follow up: Problem solved. The problem in my original code was a simple coding mistake after all. I’m not really sure why the code ran for small enough arrays but not for large ones, but now that it’s solved it doesn’t really matter. Thanks to everyone for their help.