cudaMemcpy failed with invalid argument

I have a Poisson solver code that works correctly on two A100 GPUs (80GB each, single server) for problem size 256^3 and 512^3. When I try to run it on problem size 1024^3, the following statement gives an error.

ERROR: CUDA RT call "cudaMemcpy(u_device[i]+offset, u+offset+(i*nelem_init_device), size_init_device, cudaMemcpyHostToDevice)" in line 2383 of file mg.cu failed with invalid argument (1).

This fails on the first GPU (i=0). When I break at this line in cuda-gdb, and print the parameters, they all look ok. When I run the code under compute-sanitizer –tool memcheck --report-api-errors all, it says:

========= Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemcpy.

which doesn’t tell me much.

Why am I hitting this error? Thanks.

(1) Make sure the allocations for the memory involved in the copy operation actually succeeded.

(2) Make sure offsets and / or indexes into the allocated memory are correctly computed and are not affected by integer overflow.

I guess if I were betting, I would bet on njuffa’s second point. You’re indicating the trouble occurs at 1024x1024x1024 which happens to be 1G, and when multiplied by a typical bytes-per-element number, that will hit 4G bytes. Such a calculation would overflow “ordinary” integer arithmetic. So calculation types as well as datatypes start to matter in this range for correctness. Coupled with that it is a mistake that folks make from time to time. Thus, my bet.

I can’t use 64 bit offsets with long/long long if I don’t overflow 64 bits for the pointer result?

You can use 64 bits, 64-bit offsets, and pointer arithmetic. This is C++. Probably an example of the code in question with complete definitions of all the arguments would help.

Here’s an example. Suppose you were allocating 8GB (1024x1024x1024x8 bytes per element). Suppose you mistakenly used an unsigned type rather than unsigned long long (or, equivalently, on windows, mistakenly using unsigned long instead of unsigned long long, or size_t). It’s possible that the unsigned wrap around at 4GB would result in a cudaMalloc call that didn’t report any error, but you’d end up with a size less than 8GB.

Then, when you go to do your cudaMemcpy operation, and attempt using different variables to copy e.g. 8GB, then the size (or equivalently, extent) of transfer exceeds the size of allocation. And you then get an invalid argument error on cudaMemcpy. It can be confusing.

I’m not suggesting I know what the issue is. But I think the suggestion to check your types carefully is a good one.

        /* copy cpu arrays to gpus */
        long nelem_init_device = (*n1_dev)*(*n2_dev)*((*n3_dev) - 2);
        long offset = (*n1_dev)*(*n2_dev);
        size_t size_init_device = nelem_init_device * sizeof(double);
        for (int i=0;i<ngpus;i++) {
                CUDA_RT_CALL(cudaSetDevice(i));

                /* only copy interior points on partition dimension */
                CUDA_RT_CALL(cudaMemcpy(u_device[i]+offset, u+offset+(i*nelem_init_device), size_init_device,
                        cudaMemcpyHostToDevice));
                
        }

Code is above. Note *n1_dev and *n2_dev are both 1026 while *n3_dev is 514 for this problem size. The array elements are of type double.

  • u_device[i] could be allocated on a different device than i
  • if on windows, I wouldn’t use long. Even on linux, for portability, I would advise against it.
  • need to know the type that dereferencing any variable produces. For example what is the type of *n1_dev, *n2_dev, *n3_dev. If it is unsigned, or int, then the calculation is going to blow up.
  • need to know the actual sizes of the allocation for u_device[i]

probably need other info as well. Rather than trying to play 20 questions like this, if you provide a minimal complete example that demonstrates the issue, I’m sure the underlying problem can be spotted.

n1_dev, n2_dev and n3_dev are all type int. If I make them unsigned long will that work? Note I inspected nelem_init_device and offset in cuda_gdb, and they both looked ok.

no, I’m not sure that will fix it. I already indicated more information is needed. What is all the information necessary to see how you allocate u_device[i]. note that just providing the line that includes the cudaMalloc call is probably not enough info for this investigation.

I allocate u_device[i] just prior to doing the problematic memory copy. Sorry for the complicated setup math. This is what is done in the CPU code that I’m porting to GPU. BTW, this is the NAS parallel benchmark MG.

        static int lt = LT_DEFAULT;   // LT_DEFAULT is 10 for this problem size 
        int lm,ndim1,ndim2,ndim3,log_p,log2_size, nm,nv,nr,dx,dy;
        log_p = log((float) ngpus)/log(2.0);     // ngpus is 2
        log2_size = lt;
        lm = log2_size - log_p/3;
        ndim1 = lm;
        ndim2 = lm;
        ndim3 = lm - log_p;             /* only partition on the final dimension */
        *n1_dev = 2 + (1<<ndim1);
        *n2_dev = 2 + (1<<ndim2);
        *n3_dev = 2 + (1<<ndim3);
        nm = 2 + (1<<lm);
        nv = (2 + (1<<ndim1)) * (2 + (1<<ndim2)) * (2 + (1<<ndim3));
        nr = (8*(nv + nm*nm + 5*nm + 14*lt -7*lm))/7;
...
        size_u_device=sizeof(double)*(nr);
...
        /* allocate and zero gpu arrays */
        for (int i=0;i<ngpus;i++) {
                CUDA_RT_CALL(cudaSetDevice(i));
...
                CUDA_RT_CALL(cudaMalloc(&u_device[i], size_u_device));
                CUDA_RT_CALL(cudaMemset(u_device[i], 0, size_u_device));
...
        }
        static int lt = LT_DEFAULT;   // LT_DEFAULT is 10 for this problem size 
        int lm,ndim1,ndim2,ndim3,log_p,log2_size, nm,nv,nr,dx,dy;
        log_p = log((float) ngpus)/log(2.0);     // ngpus is 2 //log_p = 1
        log2_size = lt; // 10
        lm = log2_size - log_p/3; // = 10
        ndim1 = lm; // 10
        ndim2 = lm; // 10
        ndim3 = lm - log_p; //9             /* only partition on the final dimension */
        *n1_dev = 2 + (1<<ndim1); //1026
        *n2_dev = 2 + (1<<ndim2); //1026
        *n3_dev = 2 + (1<<ndim3); // 514
        nm = 2 + (1<<lm); // 1026
        nv = (2 + (1<<ndim1)) * (2 + (1<<ndim2)) * (2 + (1<<ndim3)); // 541M
        nr = (8*(nv + nm*nm + 5*nm + 14*lt -7*lm))/7; // calculation overflows - 8x541M does not fit in int
...
        size_u_device=sizeof(double)*(nr);
...
        /* allocate and zero gpu arrays */
        for (int i=0;i<ngpus;i++) {
                CUDA_RT_CALL(cudaSetDevice(i));
...
                CUDA_RT_CALL(cudaMalloc(&u_device[i], size_u_device));
                CUDA_RT_CALL(cudaMemset(u_device[i], 0, size_u_device));
...
        }

When you get around to fixing this, the datatype of size_u_device will matter also. I haven’t tried to figure out what size_u_device will end up as - it might be positive or it might be negative, and I can’t do it anyway without knowing its datatype. But the blown up calculation for nr will make your code incorrect, in any case. I presume size_u_device is ending up as some positive value, probably quite a bit smaller than anticipated, which will cause the subsequent cudaMalloc and cudaMemset operations to not throw any error, but this size mismatch will be a problem for the subsequent cudaMemcpy that is actually throwing the error.

sorry forgot to add:

size_t size_u_device;

The calculation for nr blows up.

You may have better luck with:

  size_t lm,ndim1,ndim2,ndim3,log_p,log2_size, nm,nv,nr,dx,dy;`

instead of:

And if you are on windows, I would change all usages of long to long long, unsigned long long, size_t, or similar.

OK -will try that, thanks!

That worked! Thanks again

I adapted my code to run on four GPUs, but I hit this error again on the first GPU, but on the subsequent cudaMemcpy.

ERROR: CUDA RT call "cudaMemcpy(v_device[i]+offset, v+offset+(i*nelem_init_device), size_init_device, cudaMemcpyHostToDevice)" in line 2436 of file mg.cu failed with invalid argument (1).

I’m running the 256^3 problem now so *n1_dev is 258, as is *n2_dev, and *n3_dev is 66. My code is above, but as a reminder, I’m doing:

        /* copy cpu arrays to gpus */
        unsigned long long nelem_init_device = (*n1_dev)*(*n2_dev)*(*n3_dev);
        unsigned long long offset = (*n1_dev)*(*n2_dev);
        size_t size_init_device = nelem_init_device * sizeof(double);
        for (int i=0;i<ngpus;i++) {
                CUDA_RT_CALL(cudaSetDevice(i));
                CUDA_RT_CALL(cudaMemcpy(a_device[i], a, size_a_device, cudaMemcpyHostToDevice));
                CUDA_RT_CALL(cudaMemcpy(c_device[i], c, size_c_device, cudaMemcpyHostToDevice));

                /* only copy interior points on partition dimension */
                CUDA_RT_CALL(cudaMemcpy(u_device[i]+offset, u+offset+(i*nelem_init_device), size_init_device,
                        cudaMemcpyHostToDevice));
                CUDA_RT_CALL(cudaMemcpy(v_device[i]+offset, v+offset+(i*nelem_init_device), size_init_device,
                        cudaMemcpyHostToDevice));
                CUDA_RT_CALL(cudaMemcpy(r_device[i]+offset, r+offset+(i*nelem_init_device), size_init_device,
                        cudaMemcpyHostToDevice));
        }

Any idea what’s wrong this time? Thanks.

I think I found the problem with my address and size math - sorry for the noise