Is there any comprehensive list of what conditions/circumstances will cause what error codes to be returned by each CUDA function, specifically I’m interested in cuMemcpy2D(Unaligned) and cuMemcpyHtoD, but it would be useful information for all calls really. (Something similar to how the OpenGL manpages document what errors will be produced by what incorrect inputs/circumstances.)
Specifically I’m trying to figure out why cuMemcpy2D / cuMemcpy2DUnaligned / cuMemcpyHtoD would be returning CUDA_ERROR_INVALID_VALUE (1)… 99% of the time my calls, with identical inputs will work - 1% of the time (again, identical inputs) the calls will fail…
I can reproduce the 1% of the time (although I’m not 100% sure what’s happening to cause this error, due to the complexity of the application - I’m sure I’m properly managing my contexts between the various threads in the app, and the memory pointers / pitches / etc being passed in are 110% correct… so really puzzled what could cause CUDA_ERROR_INVALID_VALUE).
I’ve also verified the error is not coming from a previously called async function…
Anyway rant over, more curious what circumstances could cause a result of CUDA_ERROR_INVALID_VALUE…
If you had bothered to look at the documentation yourself you’d realise it’s less verbose than the enum symbol itself, “Invalid value.” - amazing documentation!
Realistically the only people who could answer this question is someone from nVidia who’s able to look at these function calls and determine when that result may be returned, and why. (and maybe update your doxygen docs in the process ;))
I personally believe that those error codes should be sufficient for even an average programmer to pin-point the error… Think about a case when you want to do a cudaMemcpy from host to device, whereas, accidentally you’ve passed wrong pointers (misplaced host and device pointers).
Now, after this memcpy function, if you’d check for error status and printed out something like “At LINE: LINE cudaMemcpy failed due to ”. I know the example I gave was a 12th grade problem, but in general as well, that should be pretty good enough starting point for a debug, isn’t it?
The inputs are valid, this has been verified. The host pointer, device pointer, and size are all valid - which means some prior calls have done something to the internal CUDA state to cause the memcpy to fail and for some reason return ‘invalid value’.
Having gone over the call traces for the two cases where it works, and where it doesn’t - prior to the memcpy in question, I can’t see any discernible difference that would cause an error like this…
Sure, I can write it up - in all of 5 lines (or at least the problem area).
error = cuMemAllocPitch(&ptr, &pitch, 640, 480, 4);
assert(error == CUDA_SUCCESS); // succeeds
// This actually comes from somewhere else, but I've validated the pointer and the data it contains (including size) - it's a valid 8bit grayscale image.
unsigned char client_buffer[640*480];
error = cuMemcpyHtoD(ptr, client_buffer, 680*480);
assert(error == CUDA_SUCCESS); // fails
As I said, 99% of the time this will work, 1% of the time it won’t - the only difference between the 1% of the time is that I’ve literally ‘just’ pushed the context onto this thread (after waiting for any other threads to stop using it - I’ve validated no other threads have this context current - and this is the only context used on the entire process, even the entire system).
I’ve also tried running ‘exactly’ the code I posted above ‘after’ I encounter the initial CUDA_ERROR_INVALID_VALUE, to see if the above sample would work - and it doesn’t (still invalid value, despite using newly allocated memory that almost certainly ‘cant’ be invalid) - indicating it’s not an issue with the values I’m passing, but the the runtime api and/or driver’s internal state.
I just want to clear a doubt before concluding further… When there’s a pitched memory, the mem size will actually be ‘rows * pitch’, rather than ‘rows * cols’, right? So, is it OK to perform a mem transfer from host to device, where host is of size ‘row * cols’ whereas the device pointer is ‘rows * pitch’?
I simplified my example code above, as in the case of 640*480 I always get a pitch of 640.
My actual code has a conditional statement along the lines of ‘if(width_bytes == pitch) cuMemcpyHtoD(…); else { … cuMemcpy2D(…); }’, I just simplified the example code to my specific case because 2D memcpy code is 12+ lines.
I should note even if I do a 2D memcpy, despite pitch == width, I still get invalid value.
If anyone from nVidia could comment on this, it would be greatly appreciated (this is stopping us from releasing… sigh).
I’m guessing the cause of the issue is pretty straight forward, but without being able to narrow down to a smaller subset of possible causes (which the documentation doesn’t give me), this isn’t a quick process…
Still no luck solving this issue over the past couple of days, the only progress I’ve made is I’ve been able to identify that the probability of the problem occurring increases as application performance lowers (indicating some kind if possible race condition I guess).
I really don’t see how there can be a race condition relating to a simple memcpy, when you literally allocated the device and host memory ‘just’ before calling the memcpy… and no other thread is touching the CUDA context that you’re running the copying on… sigh
Edit: Also, I’ve confirmed this happens to ALL memory copies once the CUDA driver state is like this - Device<->Device, Host<->Device, Device<->Host, for both page locked and pagable host memory… it doesn’t really matter what type of memory I give it (valid memory, or invalid memory), or what type of copy I try to run, I simply get CUDA_ERROR_INVALID_VALUE in all cases.
Hmmm, after giving up on this issue and starting to debug another issue - I discovered this is either an XP or Driver specific problem…
I don’t see this issue ‘at all’ in Windows Vista SP1 32bit, with 195.21 drivers (nexus preview2 drivers)… I’m not sure if this is a bug that’s been fixed in your newer alpha/beta drivers, or if it’s just an XP driver issue (I don’t have the time to test various driver versions in Vista right now).
Either way I’m pretty happy now, despite incurring the typical 40-60% performance loss by running in Vista instead of Linux/XP.