cudaMemset in 11.4: what causes it to give cudaErrorInvalidValue?

I’m passing a valid CUDA buffer to cudaMemset, and in CUDA 11.4 (but not 11.0) when I call cudaGetLastError it gives cudaErrorInvalidValue. I’m just calling it as cudaMemset(buf, 0, size) and size is non-negative. What conditions cause cudaMemset to return invalid value? I’d expect it to do that for a negative size, but that’s not my case. Maybe the CUDA context is corrupted somehow? The buffer address is exactly what I got when I allocated it.

The cudaErrorInvalidValue may be coming from somewhere else in your program. You might have stack corruption in your host code. A small self contained reproducer is likely to lead to a better answer. If the problem goes away when you simplify your code but leave the cudaMemset operation intact, then the source of the error is presumably not that call itself.

I should’ve mentioned, I always call cudaDeviceSynchronize() and cudaGetLastError()around my CUDA calls when debugging, to avoid the async problem you’re mentioning. And I wish I could get a small reproducer! It’s a huge production code with hundreds of kernels, etc. I agree with you that the cudaMemset call is not the error, which is why I asked: which arg(s) will cause it to return cudaErrorInvalidValue? If I know that, then I can eliminate some of my debugging possibilities.

  • an invalid pointer, i.e. one not returned by a CUDA allocator, or one previously freed
  • a size that is larger than the allocation size
$ cat t1883.cu
#include <iostream>

int main(){

  char *data, *d1 = NULL;
  size_t size = 256;
  cudaError_t err = cudaMalloc(&data, size);
  if (err != cudaSuccess) {std::cout << "e1 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(d1, 0, 256);
  if (err != cudaSuccess) {std::cout << "e2 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(data, 0, 256);
  if (err != cudaSuccess) {std::cout << "e3 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(data, 0, 257);
  if (err != cudaSuccess) {std::cout << "e4 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(data, 0, 0);
  if (err != cudaSuccess) {std::cout << "e5 " << cudaGetErrorString(err) << std::endl;}
}
$ nvcc -o t1883 t1883.cu
$ ./t1883
e2 invalid argument
e4 invalid argument
$

If you have ruled out those possibilities, and also ruled out the asynchronous path, then my guess would be host code stack corruption is a likely culprit. Note that since the size is a size_t, there is no possibility of a negative size interpretation. In the presence of host code stack corruption, I would generally describe any program that does that as exploring UB in C or C++. Under UB, the behavior of the code is unpredictable; any outcome is possible, including any observed outcome from any CUDA API call.

That is very helpful! (Especially knowing that it tracks allocated size to enable the size check.)
Thanks!

I don’t suppose there happens to be a CUDA function to return the allocated size, given a cudaMalloc’ed pointer? (Or even check a pointer the way cudaMemset does?) I expect that’s considered implementation detail, but it would be helpful to me…

I’m not aware of one.

If you’re feeling silly, and are going to memset the whole thing anyway, you could do a binary search:

$ cat t1883.cu
#include <iostream>
#include <cstdlib>

size_t find_allocated_pointer_size(void *p){
  size_t test1 = 1048576ULL*1024*100; // start at 100GB, would not work for oversubscription
  size_t test2 = 0;
  if (cudaSuccess == cudaMemset(p, 0, test1)) {std::cout << "allocated size is " << test1 << " bytes or larger." << std::endl; return test1;}
  while (test1 > test2+1){
    size_t test3 = ((test1-test2)>>1) + test2;
    if (cudaSuccess == cudaMemset(p, 0, test3)) test2 = test3;
    else test1 = test3;
  }
  return test2;
}

int main(int argc, char *argv[]){

  char *data, *d1 = NULL;
  size_t size = 256;
  if (argc > 1) size = atol(argv[1]);
  cudaError_t err = cudaMalloc(&data, size);
  if (err != cudaSuccess) {std::cout << "e1 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(d1, 0, 256);
  if (err != cudaSuccess) {std::cout << "e2 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(data, 0, 256);
  if (err != cudaSuccess) {std::cout << "e3 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(data, 0, 257);
  if (err != cudaSuccess) {std::cout << "e4 " << cudaGetErrorString(err) << std::endl;}
  err = cudaMemset(data, 0, 0);
  if (err != cudaSuccess) {std::cout << "e5 " << cudaGetErrorString(err) << std::endl;}
  size_t result = find_allocated_pointer_size(data);
  std::cout << "Size is " << result << " bytes" << std::endl;
}
$ nvcc -o t1883 t1883.cu
$ ./t1883
e2 invalid argument
e4 invalid argument
Size is 256 bytes
$
1 Like

The first thing I would do here is temporarily add CUDA error checking just before the cudaMemset to make sure there is no pending error that simply gets exposed by the call.

If that passes (no error detected prior to cudaMemset), the next experiment I would do is temporarily set the size argument of the suspect cudaMemset() call to 1. If you still get cudaErrorInvalidValue on that call after that, you know the pointer passed to cudaMemset is invalid.

Follow the bad pointer upstream in the code. I usually simply add ad-hoc printf calls which print relevant information, e.g. module::functionname::instance: ptr=. The printouts create a text log for perusal and comparison across runs. Even in a large application it shouldn’t take more than a couple of hours to locate the root cause.

Haha, I love this. Extra points for not just the idea, but working code!

For others coming here later, I discovered you can use cudaPointerGetAttributes for this. It returns success with attrs.type == cudaMemoryTypeDevice and a nonzero attrs.devicePointer for valid device pointers (e.g. returned from cudaMalloc).

1 Like

I’m still working on this bug 😭. I’m now using cudaPointerGetAttributes to test all my GPU buffers after every CUDA call (as well as synchronizing before & after each), and I’ve narrowed it down to cufftDestroy(). Before that call, all buffers are valid; after that call, for all my buffers, cudaPointerGetAttributes returns success, but returns a devicePointer address of 0! After that point, any cudaMemset will fail on those buffers.

This only happens on CUDA 11.2 and later, apparently. And I think only on Windows.
I’m definitely passing a valid plan ID to cufftDestroy (it’s 1, as returned from cufftPlan2D), and cufftDestroy returns success as does cufftPlan2D.
I’m pretty sure I have no host stack/heap corruption; I’m using a very careful host allocator with bounds checking and the (very large) app is otherwise behaving properly. Also, cuda-memcheck doesn’t report any issues.

I’ve written a small reproducer that mimics the order and size of CUDA mallocs/frees and cufft calls, but of course everything works fine there… Are there any known issues with cufftDestroy? Is there any way for it to trash the CUDA heap in some unusual circumstance? Wish I could peek into its source.

cufftDestroy shouldn’t cause other unrelated pointers to be messed up.

This sounds like stack corruption. And a call to a library can certainly cause latent stack corruption to become problematic.

Stack corruption is of course always possible. I’m running with all protections turned on, and I’ve examined the code extremely carefully (including running on linux under valgrind), so I don’t think that’s the case, but I can’t rule it out totally. (Heap corruption seems more likely since I expect cufftDestroy is calling free, but either could be the case.)
If I omit all the cufftDestroy calls (defer them til the end), the full app works fine for many hours. And as I mentioned, something changed in CUDA 11.2 to trigger my bug – not saying it’s a cufft bug, but it is a change. Looking at the release notes, I note that cufft now keeps a lock of some kind on the fft plan, for instance.
Well, if you think of anything interesting I can try, please let me know!

bugs (in cufft, or elsewhere) are always possible. This is really just guesswork, from my end, anyway. If you can develop a self-contained complete repro case/example, then my suggestion would be to file a bug. You may also wish to ask specific questions about cufft in the library forum.